APPLICATION-INDEPENDENT  DOCUMENT  STORAGE 
USING  A  GENERIC  MARKUP  LANGUAGE 


By 

TONY  VINCENT  HARRISON 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 

OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 

OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 

1995 


Copyright  1995 

by 

Tony  V.  Harrison 


Dedicated  to  Mom, 
Dad,  and  Ewell. 


ACKNOWLEDGEMENTS 

I  would  like  to  thank  Dr.  Watson  for  the  opportunity  to 
work  with  him  on  this  research  project.  He  has  helped  me  in 
more  ways  than  he  knows.  This  technology  is  the  future  in 
information  management.  Thanks  again,  Dennis.  I  would  also 
like  to  thank  Dr.  Mishoe  for  taking  over  for  Dr.  Shoup  as  my 
major  professor.  He  has  also  been  a  great  help  to  me  as  a 
friend  and  student.  Special  thanks  to  Dr.  Beck  for  help  in 
understanding  the  database  and  retrieval  process  with  FAIRS, 
and  the  overall  CD-ROM  project.  Also,  I  wish  to  thank  Jeff 
Nelson,  David  Williams,  Ling  Li,  and  the  entire  FAIRS  staff 
for  their  input.  Thanks  to  Dr.  Peart  for  help  with  learning 
systems,  for  the  process  has  been  used  by  me  in  this  project 
and  others.  To  Dr.  Kilmer,  I  would  like  to  thank  him  for  his 
patience  throughout  this  long  process.  I  would  like  to  thank 
Mary  Cilley  for  her  work  with  the  FAST-WP  development  process 
and  editing,  Steve  Eissinger  with  WP2SGML,  and  Michael  Harper 
with  retrieval  software  conversion.  I  would  also  like  to 
thank  Drs.  Shoup  and  Isaacs,  whom  I  consider  as  my  mentors 
during  my  initial  years  here  at  the  University  of  Florida.  I 
would  also  like  to  say  thanks  to  all  the  faculty  and  staff  in 
the  Agricultural  and  Biological  Engineering  Department.  I 
have  so  many  fond  memories  of  them  that  are  too  numerous  to 

iv 


list.   Finally,  to  Cecil  and  Mary  Harrison,  and  Dalton  and 
Bernice  Harrison,  thanks  for  the  support. 


TABLE  OF  CONTENTS 

ACKNOWLEDGEMENTS   iv 

ABSTRACT ix 

CHAPTERS 

I  INTRODUCTION  1 

Justification   1 

Overall  Objective   4 

Specific  Objective  Number  One  4 

Specific  Objective  Number  Two  4 

II  REVIEW  OF  LITERATURE  6 

Computers  as  Information  Tools  6 

Computer  Storage   6 

Computer  Recognition  of  Document  Information  and 

Format  6 

Generic  vs.  Specific  Markup  in  a  Computer  Generated 

Document  9 

Standard  Markup  Languages   12 

History 12 

ODA/ODIF 12 

SGML  as  an  International  Standard 14 

Other  SGML  Standards  or  Reports  Being  Reviewed 
within  the  group  that  developed  ISO  8879-1986 

(Smith,  1989a)  15 

What  SGML  Is 17 

What  SGML  Does  Not  Mean 18 

How  Does  SGML  Describe  Structure? 19 

How  Does  SGML  Work? 20 

An  SGML  Application 22 

Benefits  of  Converting  Documents  Into  SGML 

Instances 23 

Hypertext  Markup  Language  24 

Commercial  Document  Processing  Models   25 

Academic  and  Academic/Commercial  Document  Processing 

Models 26 

Abstract  Document  Model  2  6 

Andra  Text  Editor 27 

Ohio  State's  Chameleon  Project   27 

vi 


COBATEF  System   28 

FOAM 29 

FORMEX 29 

Integrated  System  for  Complex  Computer-Based 

Documents 30 

Text  Editor  Lara 31 

Maestro 32 

Mixed  mode  document  processing  system  32 

PEN 33 

TEXTNET 33 

Other  Document  Preparation  Systems  34 

Hypertext  and  Hypermedia  Systems  35 

III  PROCEDURES 41 

Procedures  For  Specific  Objective  Number  One  ....  41 
Procedures  For  Specific  Objective  Number  Two  ....  42 

IV  MODEL  DEVELOPMENT  PROCESS   44 

Determine  Sample  Set  of  FCES  Publications   44 

Results  of  Document  Analysis  on  Selected  FCES 

Publications   45 

Model  (DTD)  Development  and  Selection   46 

V  MODEL  VERIFICATION  64 

FCES  Publication  Preparation  and  Conversion  to  SGML 

Format 64 

Identification  of  Model  Elements  in  FCES 

Publications  64 

Conversion  of  FCES  Publications  Into  SGML 

Format 65 

Conversion  of  FCES  Instances  Into  Retrieval  Format  .  66 
Candide:   The  Semantic  Data  Modelling  Language  For 

FAIRS  DISCS  And  DISC9 67 

FAIRS  CD-ROM  DISCS 68 

FAIRS  CD-ROM  DISC9 70 

Multimedia  Viewer  72 

Guide 73 

Results  of  FCES  Instances  Converted  to  Retrieval 

Format 74 

Converting  FCES  Instances  into  FAIRS  Retrieval 

Format 74 

Converting  FCES  Instances  into  Multimedia  Viewer 

Format 75 

Converting  FCES  Instances  into  Guide  Format  .  .  75 
Interpretation  of  Elements  by  Automated  Retrieval 

Systems 80 

Possible  Model  Changes  80 


vii 


VI    SUMMARY,  CONCLUSIONS,  AND  RECOMMENDATIONS   81 

Summary 81 

Conclusions/Findings  83 

Observations  83 

Recommendations   85 

GLOSSARY 86 

APPENDICES 

A     DEVELOPMENT  OF  AN  SGML  MODEL 97 

B     PROCESS  AND  METHODOLOGY  FOR  DEVELOPING  AN  SGML 

APPLICATION 103 

C     SELECTED  PUBLICATIONS   Ill 

D     TREE  STRUCTURE 114 

E     MODEL  ELEMENTS,  ATTRIBUTES,  AND  ENTITIES  130 

F     STRUCTURE  OF  THE  POPUP  MENU  USED  FOR  FAST-WP  ...  148 

G     RELATIONSHIP  BETWEEN  STYLES  AND  ELEMENTS  150 

H     LITERATURE  REVIEWED  BUT  NOT  INCLUDED  IN  THESIS  .  .  170 

REFERENCE  LIST 173 

BIOGRAPHICAL  SKETCH  187 


Vlll 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 

of  the  University  of  Florida  in  Partial  Fulfillment  of  the 

Requirements  for  the  Degree  of  Doctor  of  Philosophy 

APPLICATION-INDEPENDENT  DOCUMENT  STORAGE 
USING  A  GENERIC  MARKUP  LANGUAGE 

By 

Tony  V.  Harrison 

May  1995 

Chairperson:  Dr.  J.  W.  Mishoe 

Major  Department:  Agricultural  Engineering 

Documents  are  generally  stored  in  a  proprietary  software 

format  on  a  specific  computer  platform.    Users  who  have 

computer  hardware  and  software  different  than  those  that 

developed  the  documents  must  use  manual  processes  such  as 

rekeying  the  document  to  access  the   information.    The 

objective  of  this  research  project  was  to  store  documents  in 

a  format  that  allows  any  computer  hardware  or  software  to  use 

the  information  in  Florida  Cooperative  Extension  Service 

(FCES)  documents.  Fifty  documents  were  randomly  selected  from 

FAIRS'  DISCS  (a  CD-ROM  produced  by  the  Florida  Agricultural 

Information  Retrieval  System  (FAIRS)  at  the  University  of 

Florida  (UF) )  for  model  development.   A  document  analysis  on 

the   FCES   publications   produced   a   tree   structure   that 

identified   document   elements   and   their   hierarchical 

relationships.    An  International  Standard  (ISO  8879-1986) 


IX 


known  as  Standard  Generalized  Markup  Language  (SGML)  was  used 
to  represent  FCES  publication  structure  as  a  model.  The  model 
(Document  Type  Definition  (DTD) )  was  developed  based  on  the 
tree  structure  and  the  Association  of  American  Publishers 
(AAP)  Article  model.  The  FCES  model  consisted  of  elements  in 
the  AAP  model  having  the  same  definition  as  those  in  the  tree 
structure  and  unique  elements  required  by  UP.  A  commercial 
parser  verified  that  the  model  conformed  to  SGML  rules  and 
syntax.  Each  FCES  publication  was  tagged  with  Florida's 
Authoring  System  Tools  for  WordPerfect  (FAST-WP) ,  whose  styles 
were  developed  from  the  structural  properties  in  the  model. 
The  tagged  FCES  publications  were  then  converted  to  SGML 
instances  (tagged  text  files)  based  on  the  structure  and 
content  of  the  model.  Each  instance  was  parsed  with  a 
commercial  parser  to  verify  that  the  model  was  an  adequate 
representation  of  the  structure  of  FCES  publications.  The 
FCES  model  was  then  found  to  be  application-independent  after 
converting  the  instances  into  FAIRS  DISCS  and  DISC9, 
Multimedia  Viewer,  and  Guide  retrieval  system  format  for  on- 
screen display.  The  FCES  model  developed  in  this  research  has 
been  incorporated  into  a  vertically  integrated  electronic 
information  system  used  by  FAIRS  to  deliver  FCES  information 
by  CD-ROM  and  World  Wide  Web. 


CHAPTER  I 
INTRODUCTION 


Justification 


A  significant  problem  facing  institutions  is  the 
inability  to  retain  and  share  knowledge  in  a  form  independent 
of  specific  computer  systems.  The  distribution  of  this  vast 
pool  of  knowledge  in  electronic  documents  is  primarily  as 
printed  material,  although  electronic  files  may  be  available 
from  a  publisher.  For  decades  the  distribution  of  federal 
documents  has  been  as  paper  documents  and  on  microfiche  (The 
Office  of  Technology  Assessment,  1988) .  The  cost  of 
distributing  information  as  printed  documents  is  a  concern  not 
only  of  government  agencies  but  also  of  private  businesses. 
For  example,  according  to  Robert  Bennett,  second  vice 
president  at  the  Travelers  Corporation,  only  15  percent  of  the 
$300  million  budget  for  printing  and  publishing  was  for  the 
data-center  or  printing-center.  The  remaining  85  percent  was 
associated  with  the  costs  of  weighing,  storage,  and  people 
that  sort  mail  in  the  printing  and  publishing  process 
(Francis,  1990) .  Bennett  believes  electronic  publishing  can 
reduce  a  significant  portion  of  the  printing  and  publishing 
budget  at  Travelers  Corporation. 
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The  Florida  Cooperative  Extension  Service  (FCES)  has  not 
been  immune  to  this  problem.  The  high  cost  of  distributing 
printed  documents  has  led  to  efforts  to  distribute  information 
electronically  through  computer  systems.  FCES  produces  about 
500  printed  publications  per  year.  Budget  cuts  and  increasing 
publishing  costs  moved  FCES  to  place  all  publications  on  the 
Florida  Agricultural  Information  Retrieval  System  (FAIRS) 
(Beck  et  al.,  1994).  FAIRS  provides  delivery  of  all  FCES 
documents  in  electronic  form  to  growers,  county  extension 
specialists,  researchers,  and  homeowners.  FAIRS  uses 
hypertext,  full-text  search,  and  browsing  information 
retrieval  strategies  to  retrieve  information  from  over  7  00 
megabytes  of  data.  The  same  word  processing  files  used  in  the 
printed  publication  process  are  delivered  through  the  FAIRS 
database.  However,  the  conversion  of  electronic  documents 
into  the  FAIRS  database  format  was  a  laborious  process.  A 
text  editor  divided  each  document  into  sections  based  upon 
structure  and  content.  Besides  the  labor  requirement,  any 
revision  of  an  electronic  publication  required  the  repetition 
of  the  entire  conversion  process.  These  restrictions  gave  the 
impetus  to  automate  the  entry  of  structure  and  content  from 
publications  into  electronic  information  systems. 

Electronic  publishing  typically  begins  with  the  use  of 
word  processing  software  to  develop  documents.  Word 
processing  software  is  interactive,  self-contained,  display- 
oriented,   text-formatting  and  editing  programs  used  to 
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generate  electronic  documents  (Rahtz,  1987) .  However,  there 
is  often  a  limited  ability  to  transfer  these  documents 
directly  into  an  electronic  information  system,  because  word 
processing  software  uses  different  coding  schemes  to  represent 
a  document's  structure  (hierarchical  organization,  depicted  by 
title  and  headings)  and  content  (i.e.,  text,  figures,  tables). 
However,  software  systems  for  retrieval  of  information 
typically  use  a  database  structure  for  storage  (Beck  and 
Watson,  1992) .  These  two  representations  of  document 
structure  and  content  are  incompatible,  producing 
transferability  and  portability  problems  between  the  word 
processing  files  and  electronic  information  systems. 

Retaining  document  structure  and  information  content 
during  changes  in  computer  hardware  and  software  is  critical 
for  long-term  successful  electronic  information  system 
development.  As  computer  hardware  and  software  constantly 
evolve  and  change,  knowledge  must  outlive  the  technology  used 
to  develop  and  initially  deliver  the  information.  Yuri 
Rubinsky  (1989)  says  that  "Although  one  cannot  anticipate 
future  publishing  requirements  and  technologies,  a  plan  can  be 
developed  to  recycle  information.  The  best  way  to  do  this  is 
to  store  information  in  a  standardized  way,  independent  of  any 
particular  technology  or  presentation  method  (page  9)." 


Overall  Objective 

The  overall  objective  of  this  research  project  is  to 
model  the  structure  and  represent  the  data  of  technical 
publications  in  an  electronic  form  that  is  independent  of  any 
computer  hardware  or  application. 

Specific  Objective  Number  One 

The  first  specific  objective  is  to  model  the  structure  of 
a  set  of  technical  publications,  and  represent  the  content  of 
the  publications  in  an  application-independent  form. 

Specific  Objective  Number  Two 

The  second  specific  objective  is  to  verify  the  model  by 
automating  and  testing  a  process  of  using  FCES  publications  in 
SGML  form  for  electronic  storage  and  delivery. 

The  remainder  of  this  dissertation  is  organized  into  five 
chapters.  Chapter  II  provides  an  overview  of  generic  markup 
description,  uses,  and  applications.  Also  included  is  a 
review  of  several  commercial  and  academic  formatting  models, 
and  hypertext  and  hypermedia  models  used  for  document 
conversion.  Chapter  III  describes  the  procedures  for 
accomplishing  the  overall  and  specific  objectives  in  Chapter 
I.  Chapter  IV  describes  the  model  development  process  for 
converting  FCES  publications  into  an  application-independent 
format.  Chapter  V  summarizes  the  results  of  verifying  the 
model's  application-independence  by  using  retrieval  software 
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to  present  document  information.  A  summary  of  the  authors 
work,  conclusions  and  recommendations  for  future  work  in 
modeling  documents  are  in  Chapter  VI. 


CHAPTER  II 
REVIEW  OF  LITERATXJRE 


Computers  as  Information  Tools 

Computer  Storage 

Computers  store  information  in  electronic  form,  such  as 
the  customer  database  of  a  business  that  includes  each 
customer's  name,  address,  and  telephone  number.  Businesses 
can  develop  multiple  applications  such  as  billing,  sales 
promotions,  and  customer  service  when  information  is 
organized.  Files,  chapters,  page  number,  and  series  of  pages 
are  ways  to  organize  both  structural  and  contextual 
information  in  electronic  documents  (Graphic  Communications 
Association,  1991) .  The  content  of  electronic  documents  can 
vary,  and  include  text,  images,  graphics,  spreadsheets,  and 
voice,  which  can  impede  transfer  of  information  (Ansen,  1989) . 
However,  the  key  process  is  the  electronic  storage  and 
retrieval  of  the  information  for  human  consumption. 

Computer  Recocrnition  of  Document  Information  and  Format 

Text  is  presented  in  two-dimensional  form  as  information 
for  human  consumption  (Horak,  1984)  .  For  example,  technical 
writers  can  prepare  electronic  documents  in  a  structure  for 
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reader  interpretation  of  the  information.  The  structure  aids 
a  reader  in  finding  desired  information.  However,  computers 
only  process  the  information  while  humans  interpret  it  based 
on  the  structure. 

When  computers  are  used  for  document  processing,  they  are 
primarily  used  for  either  printing  or  viewing  a  document.  For 
example,  one  could  use  information  in  a  printed  software 
manual  describing  installation  procedures  for  a  particular 
software  package.  The  structure  of  the  manual  describes  the 
organization  of  the  information.  From  this  organization, 
subjects  are  recognized  based  on  formatting  of  areas  such  as 
titles,  where  sentences  and  words  end,  key  words,  and 
punctuation.  Coombs  et  al.  (1987)  suggest  that  those 
documents  with  both  accurate  and  descriptive  markup  can  be 
ported  from  one  computer  system  to  another. 

Descriptive  markup  for  each  subject  area  in  a  document 
allows  information  to  be  processed  in  many  ways.  However, 
while  documents  usually  have  no  explicit  structure,  database 
software  can  retrieve  information  by  particular  field  names, 
such  as  customer  name,  due  to  their  explicit  structure.  While 
the  computer  only  sees  the  information  as  an  electronic 
document,  humans  can  distinguish  between  the  information  and 
format  of  the  document.  Computers  with  this  ability  could 
treat  any  electronic  document  as  a  database  of  information. 
This  would  allow  the  computer  to  find  information  inside  a 
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manual,  handbook,  or  other  document  and  use  it  for  many 
different  applications. 

Desktop  publishing  software  and  some  word  processors 
enable  publishers  to  set  up  standard  print  styles  so  documents 
from  different  authors  can  be  incorporated  into  a  uniform 
series.  These  word  processors  also  render  the  final  printed 
text  easier  for  readers  to  interpret  by  making  the  structure 
more  apparent  (Wilson,  1991) .  For  example,  a  word  processing 
file  can  allow  different  computer  hardware  to  use  and  exchange 
information  representation.  First,  the  electronic  file 
contains  codes  to  specific  page  size,  fonts,  and  position  of 
text.  Proper  conversion  or  duplication  of  print  presentation 
on  a  different  system  requires  the  same  word  processing 
software,  printer,  and  perhaps  soft  fonts.  Second,  computers 
allow  the  use  of  many  different  proprietary  software  languages 
and  packages.  Incompatible  computer  hardware  or  software 
formats  require  manual  intervention  by  users  to  use  the 
information.  This  restricts  the  electronic  access  to 
information  that  is  inside  these  documents.  Finally, 
accessing  the  information  in  electronic  documents  from 
different  institutions  and  companies  can  be  very  difficult. 
Valuable  time  has  to  be  spent  explaining  the  different 
versions,  and  possibly  helping  make  the  document  compatible 
with  another  computer  system.  Upgrading  hardware  or  software 
systems  could  be  less  desirous  and  cause  incompatibilities 
when  there  are  many  documents  in  an  existing  format.   None  of 
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these  concerns  are  new.  Smith  (1985)  quotes  a  January  1979 
report  by  the  UK  National  Computer  Users'  Forum  addressing 
these  concerns.  The  report  describes  the  multiple  character 
sets  as  the  biggest  problem  when  interchanging  data.  The 
primary  causes  of  these  multiple  character  sets  are  the 
observance  of  poor  standards. 

Generic  vs.  Specific  Markup  in  a  Computer  Generated  Document 

The  hierarchical  organizations  of  information  in 
electronic  documents  aid  a  reader's  comprehension  of  the 
material.  In  printed  form,  a  reader  can  quickly  scan  a 
document  and  understand  its  structure  by  viewing  the  different 
typefaces  and  sizes  for  various  levels  of  information.  For 
example,  a  large  bold  font  with  centered  text  could  represent 
the  highest  level  headings  and  a  left-justified  medium-size 
font  could  represent  lower  level  headings.  These 
typographical  conventions  provide  visual  cues  for  the  reader, 
and  are  important  for  video  display  of  an  electronic  document. 
For  example,  one  could  use  the  highest  level  headings  as  a 
table  of  contents,  with  a  hyperlink  to  text  for  each  heading. 
However,  the  font  and  text  justification  used  by  one  author 
for  first  level  headings  may  be  the  same  format  used  by 
another  author  for  third  level  headings.  These  structural 
ambiguities  prevent  modeling  a  set  of  documents  for  use  in 
electronic  information  systems. 
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Requiring  authors  to  use  the  same  word  processor  would 
allow  the  production  of  a  model  describing  the  structure  and 
appearance  of  a  set  of  documents.  The  model  would  define 
specific  font  and  positioning  codes  (or  specific  markup)  to 
describe  each  structural  element  (i.e.,  generic  markup  such  as 
title  and  author) .  The  explicitness  of  the  model  allows 
software  to  convert  from  the  file  format  of  the  word 
processing  program  to  a  format  electronic  information  systems 
could  understand.  However,  a  model  of  a  set  of  documents  must 
account  for  differences  in  specific  formatting  codes  and 
preferences  before  possible  use  in  an  electronic  information 
system. 

A  layer  of  abstraction  can  provide  a  way  to  account  for 
different  formatting  codes  and  preferences  in  an  electronic 
document.  Goldfarb  (1980)  describes  this  layer  of  abstraction 
as  markup.  The  markup  should  include  first  the  separation 
between  the  model  and  specific  formatting  codes,  and  then  the 
processing  functions  on  structural  elements  in  a  document. 
Generic  markup,  for  example,  identifying  a  first  level  heading 
as  Head  1  instead  of  with  a  large,  bold,  centered  font, 
provides  the  needed  layer  of  abstraction.  Many  word 
processing  programs  provide  generic  markup  with  a  feature 
commonly  called  style  sheets  or  styles.  Style  sheets  or 
styles  separate  structure  and  text  (content)  in  a  document 
(Stein,  1991) ,  allowing  the  formatting  or  processing  of 
subject  areas  as  needed.    Style  codes  used  for  printed 
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documents  can  serve  as  a  way  to  introduce  generic  markup  into 
an  electronic  document  (Cilley  and  Watson,  1992a  and  1992b) . 
The  generic  markup  allows  further  processing  of  a  document  for 
database  storage  or  other  processing  such  as  CD-ROM  (Cilley  et 
al.,  1990).  For  example,  the  authors  of  an  electronic 
document  are  identified  with  an  "author"  style  code  instead  of 
specific  font  information.  At  the  time  of  printing  or 
display,  specific  markup  replaces  the  generic  codes  based  upon 
the  style. 

A  system  using  generic  markup  was  initiated  at  the 
University  of  Florida  for  a  CD-ROM  project  (CD-ROM 
Implementation  Group,  1990) .  The  approach  of  the  developers 
was  to  add  generic  codes  to  word  processing  documents  as  an 
aid  to  knowledge  acquisition.  Adding  the  same  generic  codes 
to  the  same  subject  areas  in  multiple  documents  provides  a 
structure  that  can  be  modeled.  Development  of  the  document 
model  would  allow  the  conversion  of  similar  documents  into 
files  containing  both  structure  and  content.  This  separation 
gives  a  set  of  similar  documents  an  explicit  structure  that 
electronic  information  systems  can  use,  for  example,  to 
perform  a  query  search  or  display  document  information  on- 
screen. Standard  Generalized  Markup  Language  (SGML)  (ISO 
8879-1986)  provides  the  basis  for  structural  model 
development. 
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Standard  Markup  Lanquacres 

History 

Publishing  companies  recognized  the  need  for  a  standard 
specifying  document  architecture  as  early  as  the  1960s 
(Rodgers,  1989) .  In  September  of  1967,  William  Tunnicliffe  of 
the  Graphics  Communications  Association  (GCA)  suggested  using 
generic  coding  as  descriptive  tags  to  separate  information 
(content)  from  format  (Goldfarb,  1990).  In  1969,  Charles 
Goldfarb  of  IBM  developed  a  Generalized  Markup  Language 
(GML) (Goldfarb  et  al.,  1970) (Goldfarb,  1980)  to  integrate  law 
office  information  systems  (Goldfarb,  1990) .  Goldfarb' s  work 
served  as  the  basis  for  the  international  standard.  Standard 
Generalized  Markup  Language  (ISO  8879-1986) .  Another  approach 
for  specifying  document  architecture  is  The  Office  Document 
Architecture . 

ODA/ODIF 

The  Office  Document  Architecture  and  Office  Document 
Interchange  Format  (ODA/ODIF)  is  an  approach  for  document 
interchange  currently  under  development  (US  Department  of 
Commerce,  1988).  Ansen  (1989)  describes  the  architecture  as 
a  set  of  standards  for  both  structuring  and  encoding  documents 
for  interchange  between  dissimilar  systems.  It  is  a  draft 
international  standard  (DIS  8613)  under  review  by  the  National 
Institute  of  Standards  and  Technology  (NIST) .   Painter  (1989) 
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suggests  that  ODA  looks  at  document  structure  both  logically 
(e.g.,  sections)  and  as  a  layout  view  (i.e.,  physical  view 
decides  how  the  content  appears).  Scheller  (1988)  outlines 
the  following  differences  between  SGML  and  ODA: 

1.  ODA  restricts  attributes  to  those  specified  by  the 
standards,  while  SGML  enables  the  definition  of  any 
desired  attribute. 

2.  SGML  documents  have  no  semantics  defined  in  the  standard, 
while  ODA  documents  contain  semantics  for  document 
representation . 

3.  ODA  restricts  the  content  of  documents  to  the  standard, 
while  SGML  has  no  restrictions. 

4.  ODA  documents  are  interpreted  by  machines  and  required 
special  input  systems,  while  SGML  has  no  such 
restrictions. 

5.  Interchange  of  ODA  documents  between  different  personnel 
needs  no  agreement,  while  SGML  documents  can  only  be 
interpreted  within  special  applications  environments. 

6.  ODA  documents  are  represented  by  a  special  formatter  or 
the  semantics  of  the  layout  description  in  a  document 
formatting  language.   The  formatter  and  document  class 
describe  the  representation  of  SGML  documents. 
Scheller  (1988)  provides  the  most  crucial  difference 

between  ODA  and  SGML  as  "the  fact  that  ODA  facilitates  the 
interchange  of  documents  with  restricted  functionality  between 
any  partner   in  an  open  computer  network,   whereas  SGML 
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documents  can  only  be  interchanged  within  clearly  defined 
applications  areas  but  are  not  subject  to  restrictions  with 
respect  to  functionality  (page  142)."  He  also  says  that  the 
number  of  representational  possibilities,  content  types, 
context-dependent  layout  descriptions,  and  automatic 
generation  of  both  tables  of  content  and  references  are  absent 
when  using  ODA  in  the  technical,  scientific,  and  publishing 
area. 

SGML  as  an  International  Standard 

In  the  early  1980s,  the  International  Standard 
Organization  (ISO)  began  preparing  standards  to  allow  transfer 
of  multiple  document  types  over  varied  computer  systems 
(Bryan,  1988) .  In  December  of  1986,  ISO  issued  its  standard 
for  document  representation  known  as  the  Standard  Generalized 
Markup  Language.  The  standard  committees  who  deal  with  the 
SGML  standard  areas  are  as  follows: 

1)  Joint  Technical  Committee  1  (JTCl)  for  Information 
Processing. 

2)  Sub   Committee   18   (SC18)   for  Text   and  Office 
Systems. 

3)  Working  Group  8   (WG8)   project  15  for  Computer 
Languages  for  Processing  Text. 

The  project  editor  of  the  standard  is  Charles  Goldfarb  of  IBM 
Corporation  in  the  United  States. 
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Other  SGML  Standards  or  Reports  Being  Reviewed  within  the 
group  that  developed  ISO  8879-1986  (Smith.  1989a) 


1)  Document  Style  and  Semantic  Specification  Language 
(DSSSL  -  ISO  10179)  -  This  standard  provides  a 
language  to  describe  the  translation  of  SGML  markup 
in  a  document  to  a  specific  format.  The  simplest 
application  would  be  a  style  sheet.  This  goes  in  a 
separate  specification  than  the  DTD.  It  allows  the 
exchange  of  an  SGML  file  among  different  systems, 
and  has  specific  DSSSL  representations  for  the 
document . 

2)  Font  Information  Interchange  (ISO/DIS  9541) 
Problems  occur  when  exchanging  a  page  or  document 
in  DSSSL  form  from  one  computer  to  another  when 
printers  are  different.  For  example,  a  Helvetica 
14-point  font  may  look  different  from  one  printer 
to  another.  Thus,  there  has  to  be  a  standard  way 
of  describing  the  fonts  for  printed  pages  of  a 
document  to  be  identical  from  system  to  system. 

3)  Guidelines  for  SGML  Syntax-Directed  Systems  -  This 
technical  report  specifies  a  series  of  guidelines 
for  the  capabilities  of  an  SGML  syntax-directed 
editing  system. 

4)  Retroactive  Conversion  -  This  technical  report 
deals  with  insertion  of  tags  into  existing  text, 
whether  in  databases  or  word  processing  files  that 
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do  not  understand  SGML.  It  looks  at  SGML  features 
that  can  reduce  the  markup  needed  within  a 
document. 

5)  SGML  Document  Interchange  Format  (SDIF  -  ISO  9069) 
-  This  ISO  standard  allows  the  interchange  of  an 
SGML  document  by  means  of  open  systems 
interconnection  (OSI)  techniques. 

6)  Standard  Page  Description  Language  (SPDL  -  ISO 
10180)  -  Xerox  and  Adobe  are  developing  this  as  a 
standard  postscript  language.  It  would  provide  a 
way  to  exchange  finished  pages  of  an  electronic 
document  between  computer  systems  in  a  standard 
way.  For  example,  an  application  receives  a 
document  page  from  one  computer  to  another  for 
identical  printing  of  the  page. 

7)  Techniques  for  using  SGML  (ISO/DTR  9573)  -  These 
techniques  describe  the  design  of  document  type 
definitions,  including  mathematics.  Criticisms 
outlined  by  Smith  (1989a) ,  which  have  been  directed 
to  this  area,  include  not  taking  advantage  of 
database  publishing  and  mathematics. 

There  are  several  ongoing  projects  for  SGML  conformance 
(Graphic  Communications  Association,  1991) .  First,  there  is 
an  initiative  from  the  executive  committee  in  the  Graphics 
Communication  Association  (GCA)  to  create  a  laboratory 
worldwide   to   test   SGML   software   for   conformance   to 
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international  standards.  Second,  there  is  a  project  for  the 
development  of  the  binary  encoding  of  SGML  (SGML-B) .  It 
entails  a  one-to-one  translation  between  an  SGML  file  and 
SGML-B  file  to  enable  quick  access  time  by  a  computer.  This 
is  important  for  CD-ROM  production.  Currently,  sequential 
coding  requires  building  indexes  for  fast  access  to 
information.  SGML-B  will  allow  CD-ROM  production  personnel  to 
place  a  binary  file  directly  on  a  CD. 

Second,  a  Hypermedia/Time-Based  Subset  (HyTime)  solution 
for  hypertext  was  first  published  by  the  ISO  late  in  1992 
(ISO/IEC  10744:1992).  HyTime  was  added  to  provide  ways  for 
different  information  to  coexist  and  work  together  in  an 
everchanging  environment.  The  combination  of  HyTime  and  SGML 
provides  greater  information  management  among  different  media 
such  as  textual  information,  audio,  animation,  and  digital 
information.  As  of  this  writing  (August,  1994),  there  are  no 
fully  HyTime-conforming  applications  in  the  marketplace. 

What  SGML  Is 

SGML  is  a  standard  for  full-text  database  publishing 
(Smith,  1986b)  that  defines  the  character  set  for  processing 
information  safely  over  any  system  (SoftQuad,  1991) .  It 
provides  a  descriptive  language  for  modeling  document 
architecture  in  specific  syntax.  In  SGML  terminology,  a 
structural  model  of  a  set  of  documents  is  a  document  type 
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definition  (DTD) .  Documents  converted  to  an  SGML  format  based 
on  the  structural  model  are  instances. 

For  model  development,  SGML  is  the  standard  for  defining 
the  element  names  (i.e.,  subject  areas),  and  their  order, 
location,  frequency,  and  relationships  within  a  document.  An 
SGML  model  explicitly  follows  the  structure  of  a  set  of 
documents.  For  example,  an  SGML  model  could  require  the 
generic  name  "chapter"  to  be  placed  at  each  chapter  heading  in 
a  set  of  documents. 

Modeling  the  structure  of  a  set  of  documents  allows 
conversion  of  documents  into  a  standard  file  format 
interpreted  by  most  electronic  information  systems.  ASCII, 
the  American  Standard  Code  for  Information  Interchange,  is  a 
standard  character  set  (i.e.,  file  format)  that  SGML  can  use 
to  represent  DTDs  (models)  and  instances.  For  example, 
software  could  convert  word  processing  files  with  generic 
markup  into  ASCII  format  (instances)  based  on  the  model  (DTD) . 
Electronic  information  systems  then  translate  the  instances 
into  their  respective  format  for  video  display. 

What  SGML  Does  Not  Mean 

First,  SGML  is  not  a  tag  set  or  programming  language.  It 
does  not  require  specific  elements  in  a  model  or  provide  a  set 
of  rules  to  mark  up  a  document.  Second,  SGML  does  not  define 
the  meaning  of  structural  elements  within  the  document. 
Rather,  SGML  provides  the  document  structure  required  by 
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electronic  information  systems  to  access  the  information 
within  a  document.  Third,  an  SGML  file  will  not  describe  how 
to  process  an  element.  An  element  is  open  to  any  processing 
application.  The  SGML  application,  a  program  that  uses  the 
tagged  SGML  file,  decides  how  each  element  will  be  processed. 
The  data  or  content  of  an  element  is  not  important,  for  there 
is  no  way  to  verify  that  the  information  between  the  tags  is 
appropriate  for  that  type  of  element.  For  example,  an  entire 
document  can  be  placed  under  the  element  "title"  in  the  SGML 
file  (instance) . 

How  Does  SGML  Describe  Structure? 

SGML  defines  document  structure  formally  so  a  computer 
application  can  use  the  information.  However,  the  structure 
must  not  be  too  restrictive,  but  flexible  enough  to  represent 
several  types  of  documents.  An  example  of  being  too 
restrictive  might  be  to  define  the  structure  of  a  set  of 
documents  that  has  100  chapters,  and  another  structure  for 
documents  with  30  chapters.  The  introduction  of  multiple 
chapters  into  the  structure  allows  both  document  classes  to  be 
defined  in  the  same  model. 

An  SGML  model  (DTD)  normally  precedes  each  tagged  text 
file  (instance).  This  enables  computer  software  to  learn  the 
structural  properties  of  the  document  that  follows.  An 
example  might  be  a  structural  model  representing  all 
automobile  technical  manuals  of  a  particular  corporation.  The 
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document  type  definition  (DTD)  is  the  model  that  defines  the 
structure  of  the  automobile  technical  manuals.  The  DTD  would 
appear  before  any  instance  representing  a  tagged  automobile 
technical  manual. 

How  Does  SGML  Work? 

•  An  SGML  model  provides  the  names  (i.e.,  generic 
identifiers) ,  location,  order,  and  frequency  of  elements  in  a 
document.  When  needed,  attributes  provide  a  greater 
description  of  elements  in  the  document  structure.  For 
example,  the  sex  of  element  "author"  could  be  either  male  or 
female.  The  element  "author"  may  have  an  attribute  named 
"sex,"  which  can  have  attribute  values  of  either  "male"  or 
"female." 

Thus  an  element  may  contain  data  and/or  other  structural 
properties.  Each  element  can  contain  information  (data) 
and/or  be  contained  within  other  elements  (content  model) . 
For  example,  a  chapter  may  contain  other  elements  such  as 
paragraphs,  titles,  and  sections.  There  is  no  limit  or 
restriction  as  to  what  type  of  information  can  go  inside  an 
element.  An  example  may  be  of  an  element  named  videotape. 
The  data  within  the  element  may  then  be  a  VHS  coded  tape. 

Conceptually,  the  model  is  always  the  same.  An  SGML 
processor  will  read  an  SGML  tagged  text  file  (instance)  and 
perform  operations  based  on  the  generic  markup  or  coding.  The 
two  pieces  needed  are  a  brain  (parser)  and  what  may  be  called 
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a  representer  (Graphic  Communications  Association,  1991) . 
The  parser  reads  and  understands  the  instance,  then  passes  the 
information  to  the  representer.  The  representer  can  then,  for 
example,  convert  the  information  (instance)  into  a  format 
suitable  for  on-screen  display,  or  provide  an  image  to  the 
information  for  presentation  in  a  publishing  system. 

An  SGML  instance  is  a  database  of  information  that  does 
not  do  anything  by  itself.  The  parser  reads  and  understands 
information  in  the  document  by  following  the  SGML  model  (DTD) . 
The  parser  reads  the  DTD  to  distinguish  between  information 
and  generic  markup  (i.e.,  element  names)  in  the  instance.  The 
parser  uses  the  DTD  to  recognize  each  element  (generic  name) 
in  the  instance.  The  instance  is  considered  validated  when 
the  parser  verifies  that  each  element  belongs  in  the  location, 
order,  or  frequency  found  in  that  instance.  An  electronic 
information  system  that  can  read  SGML  instances  of  a  specific 
SGML  model  may  then  use  the  file  as  directed.  However,  a 
representer  can  also  be  used  by  storage  and  retrieval  systems 
without  the  above  capability  to  prepare  the  information  in  an 
instance  for  their  respective  use. 

A  representer  prepares  information  in  the  instance  for 
use  by  software  such  as  electronic  information  systems.  Thus, 
the  representer  creates  a  particular  representation  for  the 
elements.  There  are  many  ways  to  develop  a  representer,  which 
can  be  very  software  intensive.  The  application  that  will 
finally  use  the  information  after  it  has  gone  through  the 
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representer  needs  instructions  that  it  can  understand.  The 
representer  will  need  a  list  of  elements,  their  meaning  (e.g., 
<t>  -  title) ,  their  needs,  and  the  specific  instructions  to 
give  to  an  application  when  it  receives  the  elements.  The 
representer  expects  the  proper  input,  because  the  parser  will 
verify  that  the  SGML  file  corresponds  to  the  DTD.  Examples  of 
a  representer  include  a  converter  for  formatting  codes, 
database  loader,  and  query  search. 

An  SGML  Application 

The  development  of  an  SGML  document  model  often  requires 
that  goals  for  the  SGML  application  be  set  and  a  working  group 
selected.  A  working  group  is  a  select  group  of  individuals 
involved  in  the  development  of  an  SGML  application  (model) . 
Given  a  subset  of  a  class  of  documents  such  as  fact  sheets 
(Figure  A-1,  Appendix  A) ,  the  working  group  breaks  down  the 
documents  into  pieces  (Figure  A-2,  Appendix  A)  and  develops  a 
tree  structure  representation  (Figure  A-3,  Appendix  A)  of  the 
document  model.  The  model  (DTD)  (Figure  A-4 ,  Appendix  A),  a 
vocabulary  representation  of  the  document  structure,  is 
written  upon  completion  of  the  document  analysis  and  validated 
for  conformance  to  ISO  8879-1986  standards  and  SGML  syntax. 
Tagged  documents  (Figure  A-5,  Appendix  A)  ,  known  as  instances, 
are  validated  to  ensure  conformance  (i.e.,  no  errors)  with  the 
model  (DTD) .  A  set  of  validated  instances  provides  the 
explicit  structure  required  by  electronic  information  systems. 
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SGML  applications  can  be  either  narrow  or  broad  in  scope. 
The  Electronic  Manuscript  Project  of  the  Association  of 
American  Publishers  (AAP)  (1987)  and  Computer-aided 
Acquisition  and  Logistic  Support  (CALS)  for  the  Department  of 
Defense  are  two  broad  SGML  applications.  AAP  developed  SGML 
applications  for  book,  journal,  and  article  creations  between 
1983  and  1987.  The  AAP  SGML  application  standard  enjoys  wide 
support  in  publishing  industries  such  as  CD-ROM,  and  has  been 
adopted  as  an  ANSI  application  standard  (Z39.59).  Cover 
(1992)  and  <TAG>  (SGML  Associates,  Inc.,  1992)  list  seven 
document  models  (DTDs)  that  are  available  on  the  Internet. 
These  include  the  Text  Encoding  Initiative  (TEI)  DTDs,  MAJOUR 
(Modular  Application  for  Journals)  DTDs  based  upon  the  AAP 
Article  DTD,  a  HyTime  DTD,  public  DTDs  available  from  Exeter, 
the  CALS-BBS  forum,  DTDs  supporting  the  AAP/EPSIG  manuscript 
standard,  and  the  "Information  Architecture"  working  group  DTD 
of  the  OSF  Documentation  Special  Interest  Group. 

Benefits  of  Converting  Documents  Into  SGML  Instances 

Time  savings  is  a  major  benefit  of  using  generic  markup 
because  the  document  does  not  have  to  be  coded  twice  (Graphic 
Communications  Association,  1991) .  The  publication  process  is 
also  reduced  because  no  further  rekeying  or  proofreading  of 
text  is  required  while  initiating  the  use  of  a  standard 
approach  to  the  preparation  of  electronic  documents  (Smith, 
1986a) .    Smith  (1986a)  provides  an  example  of  personnel 
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initially  using  word  processing  macros  to  apply  specific 
markup  to  document  information.  However,  this  was  reduced  to 
recognizing,  then  tagging  the  structural  elements  with  both 
generic  and  specific  markup.  This  made  the  job  a  lot  easier, 
faster,  and  requires  less  specialized  labor,  resulting  in 
significant  cost  savings. 

SGML  also  provides  benefits  for  the  information  retrieval 
process.  Previously,  documents  were  stored  in  either  whole 
text  form  such  as  printed  material  or  as  electronic  files  with 
"specific  markup."  SGML  aides  the  information  retrieval 
process  through  query  efficiency  and  automatic  hyperlinking. 
Query  efficiency  is  improved  by  selecting  specific  headings, 
sections,  or  topics  instead  of  an  entire  list  of  occurrences 
of  a  particular  subject.  The  hierarchical  structure  of  SGML 
documents  allows  hyperlinks  to  link  together  parts  of 
documents  such  as  words,  titles  and  sections. 

Hypertext  Markup  Lanquacfe 

The  Hypertext  Markup  Language  (HTML)  is  based  on  SGML, 
and  is  used  to  describe  the  general  structure  of  publications 
(Lemay,  1995) .  The  structural  components  of  a  publication  in 
ASCII  format  are  labeled  using  tags  defined  by  HTML.  Web 
browsers  such  as  Mosaic  (Pfaf fenberger,  1994)  ,  which  is 
supported  by  the  National  Center  for  Supercomputer 
Applications  at  the  University  of  Illinois,  provide  the 
network  functions  to  retrieve  the  HTML  documents  over  the 
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internet  and  World  Wide  Web  (WWW)  .  The  browser  then  reads  the 
HTML  information  and  formats  the  text  and  images  on  the 
screen.  The  World  Wide  Web  initiative  began  in  1990,  and  is 
a  cooperative  organization  based  at  CERN,  the  European 
Particle  Physics  Laboratory  in  Switzerland  (Lemay,  1995) . 
However,  the  tag  selection  in  HTML  is  very  limited.  Currently 
HTML  Level  One  handles  headings,  paragraphs,  images  and  a  few 
lists.  Two  other  levels  of  HTML  have  been  proposed.  HTML 
Level  Two  is  similar  to  HTML  Level  One,  but  has  additional 
features  to  support  interactive  forms  that  can  provide 
different  options  based  on  a  readers'  input.  HTML  Level 
Three,  often  called  HTML+,  will  include  elements  for  centered 
and  right-aligned  text,  tables,  mathematical  equations,  and 
the  alignment  of  text  and  images  next  to  each  other. 

Commercial  Document  Processing  Models 

Document  preparation  and  editing  involve  defining  the 
structure  and  content  of  documents,  while  formatting  is 
concerned  with  the  actual  physical  layout  of  a  document  for 
both  hardcopy  and  softcopy  (Furuta  et  al.,  1982).  Formatting 
documents  using  generic  and  specific  markup  can  serve  the  dual 
purpose  of  printed  and  on-screen  display.  Commercial  products 
vary  widely  in  applications,  from  text  preparation  to 
conversion  of  documents  to  SGML  instances.  Table  2-1  provides 
brief  descriptions  of  several  commercial  products. 
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Academic  and  Academic/Commercial  Document  Processing  Models 

Academic  and  commercial  products  resulting  from  academic 
research  have  been  developed  for  various  stages  in  document 
processing.  A  review  of  several  of  these  models  is  as 
follows. 

Abstract  Document  Model 

Kimura  (198  6)  presents  this  document  processing  system  as 
an  interactive  document  editor  based  on  an  expressive  document 
model  for  paper  and  electronic  documents.  Earlier  papers  have 
presented  concepts  of  this  abstract  document  model  in  more 
detail  (Shaw,  1980) (Furuta  et  al.,  1982) (Kimura  and  Shaw, 
1984) (Kimura,  1984) .  The  basis  for  the  document  processing 
system  is  the  notions  of  abstract  and  concrete  objects,  the 
hierarchical  composition  of  both  ordered  and  unordered 
objects,  component  sharing,  and  reference  links  (Kimura, 
1984)  .  Kimura  (1984)  also  classifies  objects  in  either 
textual,  tabular,  mathematical,  or  pictorial  classes.  Written 
in  C,  a  prototype  of  the  system  has  been  in  operation  since 
the  fall  of  1983.  The  system  consists  of  three  major  software 
modules.  The  graphical  abstract  document  editor  (ADE) 
integrates  the  abstract  object  module  (AOM)  and  window  object 
module  (WOM) ,  producing  the  prototype  system.  Abstract  object 
classes  were  also  developed  to  write  and  view  technical 
documents.  The  uniqueness  of  the  system  is  the  model, 
interrelationships  between  windows,  unique  views  generated, 
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and  allowance  of  structural  editing  within  the  document  using 
specific  coitunands  (Kimura,  1986)  . 

Andra  Text  Editor 

Andra  (Gutknecht  and  Winiger,  1984)  is  a  modern  text 
editor  and  formatter  for  the  personal  computer  Lilith  (Wirth, 
1981)  .  Professor  N.  Wirth  developed  Lilith  at  the  Institute 
for  Informatik  of  the  ETH  Zurich  from  1977  to  1980.  Andra 
consists  of  three  major  parts:  input  manager,  document 
manager,  and  display  manager.  The  input  manager  interprets 
user  input,  then  translates  commands  to  procedure  calls.  The 
document  manager  maintains  the  representation  of  documents. 
The  display  manager  continuously  shows  part  of  the  documents 
being  edited  on  the  screen. 

Ohio  Staters  Chameleon  Project 

The  Chameleon  translation  software  architecture  (Mamrak 
et  al.,  1988a,  1988b,  1988c,  and  1988d)  (Nicholas  and  Mamrak, 
1988)  was  renamed  Integrated  Chameleon  Architecture  (ICA) 
(Mamrak  et  al.,  November  1990).  The  project  studied  the 
different  ways  that  data  can  be  represented  (e.g.,  different 
word  processors)  and  translated  to  a  desired  coding  scheme 
(e.g. ,  SGML  format) .  ICA  addresses  the  broad  variety  of  data 
representations  by  the  construction  and  use  of  data 
translators  (Mamrak  et  al..  May  1987)  (Mamrak  et  al., 
September  1989) .   Both  the  design  and  implementation  of  ICA 
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toolsets  occurred  over  a  five-year  period  (Barnes,  July  1990) 
(Kaelbling,  1987)  (Mamrak  et  al.,  September  1989)  (Nicholas, 
1988)  (O'Connell,  1990)  (Share,  1988a  and  1988b) .  The  goal  of 
the  developers  was  to  design  and  implement  a  code-generating, 
user-friendly,  data-translation  architecture  that  would  handle 
translations  for  data  representations  from  a  selected  subset 
of  data  objects. 

COBATEF  System 

The  COBATEF  system  is  a  context-based  text  formatting 
system  (Peels  et  al.,  1985)  consisting  of  both  hardware  and 
software  areas  of  implementation.  An  automatic  text-element 
recognition  mechanism  takes  advantage  of  the  implicit 
structure  of  text,  opening  the  way  for  a  fully-automatic  text- 
processing  system.  The  COBATEF  system  can  recognize  text 
elements  by  their  context  in  two  ways.  The  document  can  be 
scanned  for  markup  for  element  recognition  or  by  a  processing 
procedure  that  derives  document  structure  from  the  content. 
COBATEF 's  software  package  converts  the  document  into  its 
logical  structure.  It  has  a  horizontal  formatter  text 
identification  and  vertical  formatter  that  produces  device- 
independent  print  files.  Several  papers  are  available  that 
give  the  hardware  developments  on  the  project  (Janssen,  et 
al.,  1985)  (Nijland  and  Peels,  1985)  (Peels,  1984). 
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FOAM 

The  FOAM  text  formatting  system  (Ganzinger  and 
Willmertinger,  1985) ,  stands  for  Formatting  and  Meta- 
formatting.  FOAM  was  developed  specifically  to  run  on 
available  microcomputer  systems.  FOAM  supports  meta 
(description)  and  text  levels  of  formatting.  Descriptions  of 
text  and  document  classes  (at  meta  level)  are  input  to  a  macro 
processor,  which  draws  specific  formatting  styles  from  a 
database  of  macro  definitions  and  generates  a  specific 
formatter  instance.  At  the  text  level,  the  resultant 
formatter  accepts  textual  input  of  the  described  document 
class  and  produces  a  formatted  document  based  on  the 
formatting  styles  created  at  the  meta  level. 

FORMEX 

FORMEX  (Guittet,  1985)  (i.e.,  the  formalized  exchange  of 
electronic  publishing)  was  developed  to  confront  problems  with 
recovering  varied  formats  of  electronic  data  and  text  within 
the  European  Community's  Office  for  Official  Publications 
(OP) .  The  Project  Management  Department  in  the  OP  developed 
FORMEX  as  a  way  to  store  publications  in  a  computer-readable 
format  for  information  interchange  between  multiple  authors, 
printers,  and  computer  systems. 

FORMEX  unified  two  ways  of  interchanging  electronic 
information.  The  first  approach  was  adapted  from  the  common 
communication  format  (CCF)  developed  by  UNESCO  (1984)  and 


30 
based  on  the  Format  for  Bibliographic  Information  Interchange 
on  Magnetic  Tapes  (ISO  2  7  09) .  Output  software  extracts  data 
from  files  stored  in  a  mainframe  database.  A  formatter  then 
produces  and  places  the  information  into  a  file  according  to 
the  specifications  of  the  CCF.  The  resultant  CCF  file  is  then 
validated  with  a  CCF  parser.  Automatic  and  manual  editing, 
and  pertinent  SGML  information,  are  added  to  enable  the  file 
to  be  upgraded  to  an  FORMEX  file.  However,  FORMEX  is  not 
ideally  suitable  for  the  transfer  of  electronic  documents 
consisting  of  textual  information. 

The  second  approach  met  ISO  8879-1986  standards  for  text 
preparation  and  interchange.  It  allowed  the  creation  and 
interpretation  of  electronic  document  by  humans  and  computers. 
Typically  a  document  is  produced  by  an  author  on  a  word 
processor.  The  document  conforms  to  SGML  standards  by 
presenting  the  information  in  an  explicit  format.  A  formatter 
converts  the  document  into  SGML  format  based  on  ISO  8879-1986 
standards  and  the  DTD.  An  SGML  parser  validates  the  SGML  file 
and  the  document  information  is  structured  as  specified  by 
CCF  for  upgrading  to  an  FORMEX  file. 

Integrated  System  for  Complex  Computer-Based  Documents 

Feiner  et  al.  (1981)  (1982)  developed  a  system  of 
different  programs  which  drew  pictures,  composed  pages,  and 
graphically  specified  and  presented  the  contents  of  pages  of 
computer-based  documents.  The  documents  are  known  as  directed 
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graphs,  whose  contents  are  made  up  of  nodes  (pages) .  These 
pages  can  be  nested  in  chapters  to  format  documents  such  as 
books.  The  directed-graph  structure  was  developed  from 
previous  research  on  the  Hypertext  Editing  System  (Carmody  et 
al.,  1969)  and  FRESS  (van  Dam  and  Rice,  1971),  also  known  as 
File  Retrieval  and  Editing  System.  These  text  processing 
systems  have  information  structuring  and  retrieval 
capabilities  and  are  useful  for  document  preparation 
(Strandberg  et  al.,  1976) .  The  system  is  a  series  of  programs 
for  modifying  and  presentation  of  a  document.  The  processes 
followed  for  electronic  documents  include  picture  layout, 
document  layout,  and  document  presentation. 

Text  Editor  Lara 

Lara  (Gutknecht,  1985)  is  a  text  editor  developed  for  the 
Lilith  workstation  (Wirth,  1981) .  It  succeeds  the  Andra 
system  (Gutknecht  and  Winiger,  1984)  and  does  not  depend  upon 
a  style  file.  Rather  than  applying  style  elements  to  document 
structural  areas,  Lara  copies  attributes  from  one  place  on  the 
computer  display  to  another  in  the  same  and  other  documents. 
This  allows  a  particular  group  of  documents  to  achieve  the 
same  format.  Consistency  in  the  format  of  displayed  text 
allows  a  connection  with  the  internal  data  structure.  Thus 
the  internal  data  structure  is  not  dependent  on  the  editing 
process,  but  is  inferred  from  characteristics  of  the  currently 
displayed  text. 
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Maestro 

Maestro,  also  known  as  Management  Environment  for 
Structured  Text  Retrieval  and  Organization,  is  a  model 
consisting  of  tools  to  take  advantage  of  structural  knowledge 
in  bibliographic  data  (Macleod,  1990) .  Maestro  was  developed 
using  both  conceptual  modeling  and  an  object  oriented 
philosophy.  It  consists  of  a  definitional  facility  and  query 
language  for  handling  queries  and  updates.  The  definitional 
facility  analyzes  and  constructs  a  second  document  that 
contains  a  structural  representation  of  the  original  document. 
XGML  (Exoterica  Inc.,  1987)  was  the  commercial  compiler  used 
in  this  process.  A  document  developed  with  the  definitional 
facility  consists  of  content,  attributes,  and  structure  of  the 
text.  The  query  language  was  developed  from  previous  work  by 
the  author  (Macleod  and  Reuber,  1987)  on  document-retrieval 
systems.  The  main  objective  of  this  process  was  to  develop  a 
language  that  could  naturally  handle  all  text  processing 
applications. 

Mixed  mode  document  processing  system 

Yamada  et  al.  (1987)  present  this  system  as  an  extended 
document  processing  model  for  constructing  mixed  mode 
documents.  The  system  contains  both  structuring  and  layout 
editing  processes.  The  structuring  process  entails  scanning 
a  printed  document  and  automatically  creating  a  structured 
document.   A  structuring  algorithm  is  used  to  hierarchically 
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separate  regions  such  as  characters,  figures,  and  tables.  The 
layout  editing  process  consists  of  both  content  and  structure 
editors.  An  interactive  pattern  recognition  process  was  also 
proposed.  The  method  consists  of  both  machine-dependent  and 
human-dependent  processes  that  reduce  the  psychological  load 
on  a  human  operator. 

PEN 

PEN  (Allen  et  al.,  1981) ,  a  hierarchical  document  editor, 
is  a  computerized  manuscript  preparation  system  for  documents 
containing  significant  mathematical  notation.  The  interactive 
formatter  provides  visual  feedback  as  the  author  is  typing  the 
document.  PEN'S  unique  contribution  is  that  it  provides 
notation  to  simplify  mathematical  text  entry. 

TEXTNET 

TEXTNET  (Trigg  and  Weiser,  1986)  is  currently  a  local 
level  network-based  approach  for  structuring  text.  The 
research  studied  different  text  organization  strategies  and 
their  effects  on  the  scientific  community.  However,  it  has 
seen  use  as  an  aid  for  text  manipulation.  TEXTNET  integrates 
into  one  approach  a  local  network  of  both  chunks  of  text  in  a 
document  and  linked  documents  consisting  of  on-line  literature 
of  scientific  nature.  The  text  is  stored  in  a  way  to  make  its 
underlying  structure  explicit.  This  provides  a  way  for 
meaning  to  be  extracted  from  relationships  between  chunks  of 
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text.  For  example,  one  chunk  can  support  the  results  obtained 
in  another  chunk,  whether  in  the  same  or  different  documents. 

Other  Document  Preparation  Systems 

Furuta  (1989)  examined  the  various  capabilities, 
features,  and  structure  of  other  document  preparation  systems 
such  as  TEX  (Knuth,  1984) ,  LATEX  (Lamport,  1985) ,  troff 
(Kernighan,  1981),  Scribe  (Reid,  1980)  (Unilogic,  Ltd.,  1984), 
Interleaf  (Ilson,  1988),  MacWrite  (Apple  Computer  Inc.,  1984), 
XEROX'S  Tioga  (Teitelman,  1984  and  1985),  MIT's  Etude  (Hammer 
et  al.,  1981)  (Ilson,  1980),  and  IBM's  Janus  (Chamberlin  et 
al.,  1981)  (Chamberlin  et  al.,  1982). 

Lee  and  Malone  (1988)  explored  solutions  for  computer- 
based  office  system  problems  such  as  user  communication  with 
different  templates  and  document  interchange  between  different 
word  processing  programs.  The  solutions  were  based  on  the 
development  of  extensions  to  the  Information  Lens  System 
(Malone  et  al.,  1987)  (Malone  et  al..  May  1987). 

Appendix  H  provides  other  literature  on  software 
development  and  SGML  reviewed  but  not  included  in  this 
research.  Other  software  products  (not  reviewed)  using  a 
parser  include  products  from  DocuPro,  Compugraphic,  Frame, 
Scribe  (Smith,  1989b) ,  and  Xyvision. 
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Hypertext  and  Hypermedia  Systems 

Bush  (1945)  is  generally  credited  with  the  proposing  the 
initial  principles  on  which  current  hypertext  systems  are 
based.  Hypertext  systems  allow  one  to  reference  (link) 
specific  segments  or  the  entire  current  document  in  question 
with  other  on-line  documents  in  a  variety  of  sequences 
(Newcomb  et  al.,  1991).  A  true  hypertext  system  should  make 
users  feel  that  they  can  move  freely  through  the  information 
solely  based  on  their  own  needs  (Nielson,  1990) .  Hypermedia 
systems  create  multi-media  (e.g.,  text,  graphics,  sound,  and 
executable  programs)  documents  with  hyperlinks  (Newcomb  et 
al.,  1991). 

Several  first  generation  (pre  1980 's)  hypermedia  systems 
include  NLS/Augment  (Englebart  and  English,  1968)  (Englebart 
et  al.,  1973)  (Englebart,  1984),  FRESS  (Meyrowitz,  1986), 
Thumb  (Price,  1982)  ,  and  ZOG  (McCracken  and  Akscyn,  1984) 
(Robertson  et  al.,  1981).  Conklin  (1987)  provides  a  review  of 
these  systems,  which  were  primarily  mainframe-based  systems. 
Another  system  from  the  late  1960 's  was  HES  (Carmody  et  al., 
1969)  (van  Dam,  1988) . 

The  second  generation  hypertext/hypermedia  systems  were 
research  oriented  systems  developed  for  use  on  workstations. 
These  included  KMS  (Akscyn  et  al.,  1988),  a  newer  version  of 
ZOG,  Neptune  (Delisle  and  Schwartz,  May  1986)  (Delisle  and 
Schwartz,  December  1986),  Intermedia  (Garrett  et  al.,  1986) 
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(Meyrowitz,  1986),  and  NoteCards  (Halasz,  et  al.,  1987) 
(Halasz,  1988)  (Trigg,  1988). 

The  next  generation  hypertext/hypermedia  systems  are 
currently  being  developed  for  use  on  personal  computers.  Some 
of  these  include  Guide  (Brown,  1987)  (Owl  International, 
1986) ,  Hyperties  (Shneiderman,  1987a  and  1987b) ,  and 
HyperWriter!  (Ntergaid  Inc.,  1992). 

Another  investigation  of  hypertext  structure  and  content 
produced  Trellis  (Stotts  and  Furuta,  1988  and  1989) ,  which 
allowed  the  author  to  specify  browsing  semantics  along  with 
structure  and  content  as  parts  of  a  document.  Delisle  and 
Schwartz  (1987),  Zellweger  (1988),  and  Trigg  (1988)  provide 
other  work  as  a  basis  for  allowing  the  author/reader  to 
specify  traversal  paths  for  hypertext  documents. 

Products  have  been  developed  with  hypertext  capabilities 
based  on  SGML.   Smith  (1989b)  outlines  three  such  products: 

1)  Idex,  a  product  of  Office  Workstations  Limited, 
uses  hypertext  in  a  multi-user  environment. 

2)  Optical  disks  as  an  output  medium  have  been 
addressed  by  the  Optical  Publishing  Association. 

3)  The  Silversmith  product  of  Taunton  Engineering 
combines  both  hypertext  capabilities  and  optical 
disk  production. 

Some  hypertext/hypermedia  systems  include  those  used  by 
Washington  University's  Manual  of  Medical  Therapeutics 
(Frisse,  1988)  and  The  Oxford  English  Dictionary  (Raymond  and 
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Tompa,  1988) ;  a  system  for  page-oriented  databases  (Tompa, 
1989) ;  and  a  system  for  management  of  software  life-cycle 
documents  (Garg  and  Scacchi,  1990) . 
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CHAPTER  III 
PROCEDURES 


Procedures  For  Specific  Objective  Number  One 

The  first  objective  was  to  model  the  structure  of  a  set 
of  FCES  publications,  and  represent  the  content  of  the 
publications  in  an  application-independent  form. 

The  first  procedure  for  specific  objective  number  one  was 
to  do  a  document  analysis  on  a  sample  set  of  FCES 
publications.  Fifty  FCES  extension  fact  sheets  and  circulars 
made  up  the  sample  set  of  publications.  The  document  analysis 
identified  the  structural  properties  of  the  FCES  publications, 
including  the  identification,  naming,  hierarchical  order, 
location,  frequency,  and  interrelationships  between  elements. 

The  second  procedure  for  specific  objective  number  one 
was  to  develop  a  model  (DTD)  that  provided  a  vocabulary 
representation  of  the  structural  elements  identified  during 
document  analysis.  International  Standard  ISO  8879-1986 
(Standard  Generalized  Markup  Language  (SGML) )  was  used  to 
define  how  structural  elements  were  placed  in  the  model.  The 
DTD  structure  was  compared  with  other  industry  standard  DTDs 
for  compatibility.  Currently,  the  Association  of  American 
Publisher's  DTDs  (i.e.,  book,  article,  and  serial  DTDs)  are 
the  only  recognized  American  standards.   Industry  standard 
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DTDs  were  considered  alternatives  to  in-house  development  of 
models  when  structure  was  similar.  Transferability  and 
portability  of  information  in  FCES  publications  were  important 
considerations  during  the  model  development  and  selection 
process,  XGML  Validator  (Exoterica  Corporation,  1987),  a 
commercial  software  product,  was  used  to  ensure  the  DTD 
conformed  to  ISO  8879-1986  standards  and  SGML  syntax.  This 
was  the  initial  validation  step. 

Procedures  For  Specific  Objective  Number  Two 

The  second  specific  objective  was  to  verify  the  model  by 
automating  and  testing  a  process  of  using  FCES  publications  in 
SGML  form  for  electronic  storage  and  delivery. 

The  first  procedure  for  specific  objective  number  two  was 
to  generate  electronic  files  of  the  sample  FCES  publications 
based  on  the  structure  of  the  FCES  model.  Currently, 
WordPerfect  (WordPerfect  Corporation,  1993a)  is  the  FCES 
standard  for  word  processing  software.  FCES  authors  are  using 
WordPerfect  (WordPerfect  Corporation,  1993a)  as  a  word 
processor  with  an  additional  pop-up  menu  (Appendix  F)  to  apply 
both  generic  and  specific  styles  as  markup.  The  pop-up  menu 
and  associated  software  tools  (macros,  styles,  and  soft  fonts) 
are  known  as  FAST-WP,  Florida's  Authoring  Tools  for 
WordPerfect.  After  development  of  the  DTD  in  specific 
objective  number  one,  FAST-WP  underwent  extensive 
modifications  to  reflect  the  structural  elements  in  the  FCES 
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model.  FAST-WP  was  then  used  to  tag  the  structural  elements 
in  the  sample  set  of  FCES  publications. 

The  second  procedure  for  specific  objective  two  was  to 
convert  the  tagged  FCES  publications  into  SGML  format  (ASCII 
tagged  text  files) .  In-house  software  (WP2SGML)  created  an 
SGML  instance  (ASCII  tagged  text  file)  for  each  FCES 
publication  based  on  the  model  developed  in  specific  objective 
number  one.  XGML  Validator  Software  (Exoterica  Corporation, 
1987)  parsed  each  FCES  instance  to  verify  they  conformed  to 
the  FCES  model. 

The  third  procedure  for  specific  objective  number  two  was 
to  convert  the  fifty  FCES  instances  into  a  format  that 
retrieval  systems  such  as  FAIRS,  Guide  (InfoAccess,  Inc., 
1994),  and  Mulitmedia  Viewer  (Microsoft  Corporation,  1994)  can 
understand.  The  conversion  of  instances  to  the  format  of 
several  retrieval  systems  allowed  a  subjective  evaluation  of 
the  model.  Each  structural  element  in  the  model  received  a 
ranking  for  importance  in  on-screen  display.  Elements  were 
ranked  for  on-screen  display  as  extremely  important,  somewhat 
important,  or  not  needed  for  display.  The  conversion  process 
also  determined  whether  the  translation  of  elements  into  each 
retrieval  system  format  would  be  automatic  or  manual.  The 
generation  of  electronic  databases  from  FCES  publications  in 
SGML  format  (instances)  was  used  to  verify  that  the  DTD  was 
application-independent  and  a  useful  model  of  FCES 
publications. 


CHAPTER  IV 
MODEL  DEVELOPMENT  PROCESS 


Currently,  no  other  college  institution  has  a  model  or 
model  development  process  that  FCES  can  review  for  developing 
application-independent  document  storage  for  publications. 
Appendix  B  describes  an  industry  process  that  uses  a 
generalized  markup  language  as  the  preparatory  step  for 
application-independent  document  storage  of  publications.  An 
SGML  tutorial  (Graphic  Communications  Association,  1991) 
provided  this  process  of  model  development  and  gave  some 
direction  for  FCES  development  of  SGML  applications. 

Determine  Sample  Set  of  FCES  Publications 

The  number  of  FCES  publications  chosen  as  a 
representative  sample  set  was  fifty.  The  documents  were 
selected  from  volumes  one  and  two  on  FAIRS  DISCS.  Only 
documents  created  after  June  1,  1993  were  selected  to  ensure 
that  most  publications  were  tagged  with  FAST-WP.  A  total  of 
eight  hundred  and  ninety-five  publications  fell  into  this 
category.  Each  filename  begins  with  two  letters  that  describe 
the  heading  under  which  the  publication  can  be  found.  The 
five  digit  number  following  the  two  letters  is  the  actual 
document  number.  A  mathematical  function  @RAND  in  the  Quattro 
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Pro  spreadsheet  program  (Borland  International,  Inc.,  1992) 
was  chosen  to  randomly  select  the  fifty  documents.  Appendix 
C  provides  the  filenames  and  titles  that  were  randomly 
selected. 

Results  of  Document  Analysis  on  Selected  FCES  Publications 

A  graduate  level  course  was  the  initial  setting  for 
structural  analysis  of  FCES  publications  (Harrison  et  al., 
1992)  .  Members  of  the  class  consisted  of  mostly  computer  and 
communication  specialists.  The  class  was  divided  into  two 
working  groups,  with  each  group  having  an  identical  subset  of 
FCES  publications.  Both  groups  did  a  document  analysis  on  the 
FCES  publications  and  wrote  a  tree  diagram  to  represent  the 
structure.  An  attempt  was  made  to  develop  one  tree  diagram 
from  the  structures  of  both  trees.  Generally,  each  group 
preferred  their  own  description  of  the  FCES  publication 
structure.  Keeping  both  tree  structures  would  result  in  the 
development  of  two  models  to  represent  the  same  FCES 
publications.  Describing  FCES  publication  structure  in  two 
formats  would  prevent  one  group  from  interpreting  the  others' 
information.  The  authors'  experience  with  the  class  showed 
that  structural  specifications  for  specific  informational 
areas  are  open  to  varying  interpretations.  Usually,  the 
interpretation  of  structure  in  FCES  publications  was  dependent 
upon  the  person's  level  of  experience,  training,  personal 
preference,  and  knowledge  of  the  publishing  environment. 
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At  this  point  the  author  did  a  more  extensive  document 
analysis  on  the  set  of  FCES  publications.  Structural  areas  in 
each  document  were  broken  down  to  the  textual  (lowest)  level, 
and  definitions  were  given  for  each  area  of  information.  Each 
definition  was  based  on  a  review  of  identical  areas  of 
information  in  the  set  of  FCES  publications.  Further 
refinements  included  the  relating  of  text  back  to  its  parent 
structural  elements,  merging  of  similar  structural  areas, 
deleting  of  redundant  structural  areas,  and  restructuring  and 
simplification  of  structure.  The  primary  organization  of  the 
first  level  of  structure  was  a  front  matter  and  body  matter. 
These  two  main  elements  allow  all  other  informational  elements 
to  attach  at  some  level  in  the  structure.  The  tree  diagram 
and  structural  definitions  were  the  basis  for  the  FCES  model 
development. 

Model  (DTD)  Development  and  Selection 

It  is  important  at  this  phase  to  decide  whether  one 
should  use  a  custom  model  for  private  use  or  modify  an 
existing  model  in  the  marketplace.  Modifying  an  existing 
model  (DTD)  in  the  marketplace  could  increase  access  by  the 
public  to  information  in  FCES  publications.  The  information 
in  FCES  publications  can  be  accessed  and  used  by  those  systems 
that  can  read  and  interpret  the  existing  model.  Removal  of 
these  modifications  allows  the  information  to  reflect  the 
original  model  if  needed.   A  review  of  literature  found  that 
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the  Association  of  American  Publishers  (1987)  models  are  the 
only  models  recognized  as  an  American  National  Standards 
Institute  (ANSI)  standard.  The  AAP  models  are  entitled  Book, 
Article,  and  Serial.  A  comparative  analysis  of  the  AAP  models 
with  the  FCES  tree  diagram  and  structural  definitions  were 
initiated.  Upon  review,  the  AAP  Article  DTD  most  closely 
resembled  the  structural  properties  and  definitions  of 
information  in  the  FCES  publications.  Appendix  D  lists  the 
tree  diagrams  that  represent  FCES  publication  structure  using 
the  AAP  Article  model  and  some  unique  elements.  Figure  1 
shows  the  current  FCES  model. 

The  first  set  of  elements  unique  to  the  FCES  publication 
model  were  hyperlink  (hyp)  ,  link  word  (Iword)  and  action 
(act)  .  The  AAP  Article  model  does  not  provide  a  way  to 
reference  a  hyperlink  in  publications.  For  example,  tagging 
"Figure  1"  in  the  text  of  an  FCES  publication  as  the  hyperlink 
link  word  (Iword)  that  the  retrieval  software  can  do  some 
action  (act)  on.  Two  attributes  were  defined  for  the  element 
act  to  describe  the  type  of  action  to  be  performed  on  each 
hyperlink.  The  first  attribute  was  named  "type."  The 
attribute  values  of  "type"  could  be  text,  a  graphic,  an 
executable  program,  audio,  a  table  or  a  data  record.  The 
second  attribute  for  element  act  was  named  "descr."  This 
attribute  stores  in  an  FCES  instance  the  information  that  was 
placed  in  a  comment  box  for  each  hyperlink.  The  descr 
attribute  has  a  value  of  character  date  (CDATA) .   Hyperlink 
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elements  were  placed  as  an  inclusion  within  the  content  model 
of  the  article  element.  This  allows  a  hyperlink  to  occur 
anywhere  within  an  FCES  publication. 

The  second  set  of  elements  unique  to  the  FCES  publication 
model  were  the  rectangular  coordinates  for  simple  and  complex 
tables  (tdim) .  The  elements  provided  a  way  to  describe  the 
row  height  (rowhgt)  and  column  width  (colwid)  in  inches  for 
each  row/column  combination  in  simple  and  complex  tables. 
Their  main  purpose  was  to  simplify  computation  of  table 
dimensions.  The  row  height  and  column  width  were  listed  as 
comma  delimited  numbers  in  1200ths  of  an  inch,  and  described 
all  columns  and  rows  in  a  table. 

Five  attributes  were  added  to  complex  table  header  (cth) , 
simple  table  cells  (c)  and  complex  table  cells  (cte)  to  aid 
retrieval  software  in  displaying  tables  on-screen.  The  first 
attribute  (shaded)  described  if  the  element  was  shaded  or  not 
(y/n) .  The  remaining  attributes  were  named  for  the  top  line 
(topline) ,  bottom  line  (botline) ,  left  line  (Iftline)  and 
right  line  (rgtline)  that  surrounds  each  of  the  three 
elements.  The  four  attributes  were  given  the  same  attribute 
value  name  entitled  "name."  The  name  defined  whether  each 
line  surrounding  the  three  elements  was  (n)one,  (s) ingle, 
(d)ouble,  d(a)shed,  d(o)tted,  (t)hick  or  (e)xtra  thick. 
Appendix  E  defines  the  elements,  attributes  and  entities  in 
the  FCES  model. 
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<!DOCTYPE 

r 

article 

I 

<!F.I.F.MENT 

article 

(fin,  bdy)              +(fig[fh|hyp)> 

<  [ELEMENT 

fin 

-- 

(tij.  au*,  pobfm?,  abs*)> 

<!ELEXlENrr 

tig 

-• 

(atl)> 

<!F,TFMENT 

■U 

-- 

(#PCDATA  1  ((it  1  b  1  e  1 1  e2 1  e3)  1  f  1  fgr))*  > 

•    <!ELE^fE^^^ 

(itjb|el[e2|e3) 

(#PCD  ATA  1  ((it  1  b  1  e  1 1  e2 1  e3)  I  f  I  fgr))*  > 

<  [ELEMENT 

fgr 

(«>CDATA  1  ((it  1  b  1  el  1  e2 1  e3)  j  f  1  fjr))*  > 

<  (ELEMENT 

f 

(sup  1  inO  > 

<!RLF,MENT 

(snplinO 

-- 

(#PCDArA)> 

<!F.I,F.MENT 

au 

(siun)> 

<!ELEMENT 

(onmlsniti) 

(#PCDATA!((it|h  el|c2|c3)|f|fgr))*> 

<  I  ELEMENT 

puhfnt 

((crt|avl)|(aid|)Ksn)r> 

<!HT.EMENT 

(aid  1  issn) 

-- 

(ilfPCDATA  |((it|b  el  |e2|e3)|  f|  fgr))*> 

<  [ELEMENT 

avl 

-- 

(oiijn)> 

<  (ELEMENT 

crt 

.. 

(crd)> 

<  [ELEMENT 

crd 

•  • 

(mo?,  day?,  yr)> 

<[F,rFMENT 

abs 

(h?.p,  (pl(thlctbl)  1(1  II 121 13)10*) > 

<!EIEMENT 

(moldaylyr) 

0!rPCDATA)> 

<  [ELEMENT 

h 

-• 

(#PCDATA]((illb|el|e2je3)|f|fer))*> 

<  [ELEMENT 

bdy 

-- 

(sec)+> 

<  [ELEMENT 

aec 

-- 

(st,(p  1  (Ibl  1  ctbl)  1  (1 1 1 12 1 13)  1  f  1  top  !)♦,  Ks  I  ♦)  > 

<  [ELEMENT 

ssl 

-- 

(st,(pi(lbl|ctbl)|  (111  121 13)1  f|lopl)*,ss2*)> 

<  [ELEMENT 

ss2 

-- 

(st,(p|(lhl|ctbl)|(ll|12|13)|f|tnpl)*,R.s3*)> 

<  [ELEMENT 

ss3 

-- 

(st,(p|(lbl|ctbl)|(ll|12|13)|f|lupl)*)> 

<1ELF.MKNT 

St 

-- 

(*fPCDATA  j  ((it  1  b  1  c  1 1  c2 1 1:3)  1  f  1  fgr))*  > 

<  [ELEMENT 

P 

(«'PCDATAl((il|h|e1|e2|e3)|fgr)|(tbl  ctbl)|(ll  |12|I3)|  0*> 

<  [ELEMENT 

lop  I 

(h?,p.(p|(tbl|clbl)|(ll|12113)|0*)> 

<  (ELEMENT 

(11112113) 

(Ih?,  li*)> 

<  [ELEMENT 

Ih 

(/»PCDATA|((it|blcl|c2|c3)|f|fgr))*> 

<  [ELEMENT 

li 

-- 

(p,(p|(thl|clH)   01!12113)|O*)> 

<  [ELEMENT 

fig 

-- 

EMK1T> 

<  [ELEMENT 

fn 

-- 

(p,(p|(thl|ctbl)    (111  12113)1  f)*)-(r.«|fn)> 

<  [ELEMENT 

hyp 

-- 

(Iword,  act)  -(fig  fh|hyp)> 

<  [ELEMENT 

Iword 

-- 

(#PCDATA)> 

<  (ELEMENT 

act 

(«'PCDATA)> 

<IA1-1LIST 

act 

type 

(text  1  bitmap  ]  exeprgm  [  audio  |  table|  datarec)#REQUIRED 

dcscr 

CDATA                                                          #REQUIRED> 

<  (ELEMENT 

Ibl 

(no?,tt.tdin>?,lby)-(fie|  fn|  tbl|  elbl)> 

<  [ELEMENT 

no 

-- 

(»PCD,\TA)> 

<  [ELEMENT 

tt 

-- 

(;»PCDATA|((it|h|el|e2|e3)lf|fgr))-> 

<  lELEMEirr 

tby 

(th*,  tsh*,  row*)> 

<  (ELEMENT 

row 

(tsb':'.  €*)> 

<  [ELEMENT 

(thitsh) 

(#PCDATA|((il|b|el|c2  e3)|f|fgr»»> 

<  [ELEMENT 

(tsb|c) 

(p,(p|  (11112  13)'f)*)> 

<[ArrLIST 

c 

sJiaded    (y|n)                      ^REQUIRED 
topline    NAME                ^REQUIRED 
hotline    NAME                 ^REQUIRED 
Ifllioe     N\\ME                 ^REQUIRED 
rgtlinc    NAME                 #REQU1RED> 

<!—  The  NAME  defines  whether  there  is  (n)o  (inc.  (5)inglc,  (d)oublc,  ->                                                                    | 

<[-d(a)shed. 

d(o)tted,  (t)hick. 

or  (e)xtf«-thick  line  at  the  top,  left,  —  >                                                                         1 

<  [—  right,  and  bottom  of  each  cell  — > 

<  (ELEMENT 

clbl 

(cthd.cthy,ctbf)-(*igl'ii|"il|clhl)> 

Figure  4-1  —  The  FCES  Model, 
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<!ELEMENT 

Clhd 



(no?.ctt?,tdim?.cthr»)  > 

<  [ELEMENT 

etc 

((il|b|el|e2te3)  |f|#PCDATA)*>                                        | 

<  (ELEMENT 

tdirn 

(colwid,  rowhgl)> 

<  [ELEMENT 

(colwid  1  ^o^«-hgl}  -  - 

(#PCDATA)> 

<  [ELEMENT 

cthr 

(cth*)> 

<  [ELEMENT 

(delete) 

(p|(ll|I2|l3)|f)*> 

<  [ELEMENT 

(cthlctsbl) 

(p|ai|12|13)|0*> 

<!ATnJST 

clh 

align       (l|c|r|d) 
valign     (t|m|b) 
cb           NUMBER 
ce          NUMBER 
rb           IWl^ER 
re           NUMBER 
shaded    (y  |  n) 
topline    NAME 
botlins   NAME 
Iftline     NAME 
rgtline    NAME 

IMPLIED 

#IM  PLIED 

IMPLIED 

*IM  PLIED 

*IM  PLIED 

^IMPLIED 

^REQUIRED 

#REQUIRED 

#REQUIRED 

^REQUIRED 

#REQUIRED> 

<  I-  The  NAME  defines  whelher  there  is 

(n)o  line,  (s)ingle.  (d)oiible,  —  >                                               1 

<!-d(a)shed, 

[l(o)«led.  (t)hick. 

or  (e)Atra-lhick  liite  a(  the  lop,  left. 

--> 

< !—  right,  and  bottom  of  each  tell  —  > 

<!ATTLlS-r 

ctsbl 

ah^       (l|c|r|d) 
valign     (tjinjb) 

#1MPLIED 
inMPLlED> 

<!AJ-1LIST 

cte 

align       (l|c|r|d) 
valign     (t|m|b) 
cb          NUMBER 
oi          NUMBER 
rb           NUMBER 
re             NUMBER 
shaded    (y  |  n) 
topline    NAME 
hotline    NAME 
Iflline      NAME 
rgUine     NAME 

AIMPLIED 

^IMPLIED 

IMPLIED 

WMPLIED 

fflMPLIED 

#1M  PLIED 

#REQU1RED 

^REQUIRED 

(CREQUIRED 

/CREQUIRFD 

#REQUIRED> 

< !-  The  NAME  defines  whether  there  is 

(n)o  line,  (s)in£le,  (d)ouble,  ">                                                  | 

<!-d(a)shed. 

[l(o)tled,  (I)hick, 

or  (e)xtra 

-thick  line  at  the  lop,  left. 

-> 

< !-  right,  aiul  bottom  of  each  cell  -> 

<  [ELEMENT 

ctby 

(ctr)*> 

<  [ELEMENT 

ctr 

(cisbl?.cle*)> 

<  [ELEMENT 

ctbf 

(ctc)> 

<  [ENTITY 

amp 

'&"  > 

<[ENinY 

U 

"<•  > 

<!ENT1TY 

rsqh 

•1"  > 

<  [ENTITY 

aacute 

SDATA 

•[aacute]" > 

<  [ENTITY 

Aacuie 

SDATA 

•[Aacute]' > 

<  [ENTITY 

auml 

SDATA 

•[auml]"> 

<  [ENTITY 

Auml 

SDATA 

•[Auml]"> 

<  [ENTITY 

eacute 

SDATA 

"[eucute]'> 

<  [ENTITY 

Eacute 

SDATA 

"|EacuteJ'> 

<  (ENTITY 

iacute 

SDATA 

*|iacule|"> 

<  (ENTITY 

lacute 

SDATA 

"(Iacute]"  > 

<IEMnrY 

oacute 

SDATA 

I                 "[oacute]"  > 

<  [ENTITY 

Oacute 

SDATA 

"[Oacute]"  > 

<  [ENTITY 

ouml 

SDATA 

"[ouml]"> 

Figure   4-1    —   continued. 
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<!E^mTY 

Ouml 

SDATA 

[Oum)]'> 

<!ENnTY 

uacute 

SDATA 

(uacule]"> 

<!ENTITY 

Uacute 

SDATA 

[Uaciitel'> 

<!  ENTITY 

uumi 

SDATA 

[uuinlj"> 

<!ENnTY 

Uumi 

SDATA 

[Uumi]"  > 

<!ENnTY 

bull 

SDATA 

rbuiir> 

<!  ENTITY 

squf 

SDATA 

[squf]"> 

<!ENTnT 

diams 

SDATA 

[diams]"  > 

<!ENTITY 

cent 

SDATA 

[cent]"  > 

<!ENT1TY 

check 

SDATA 

[check]"  > 

<!ENTITY 

copy 

SDATA 

[copyl"> 

<!  ENTITY 

ndash 

SDATA 

[ndash]-  > 

<  iENTlTY 

tndash 

SDATA 

[mdash]"> 

<!ENinY 

deg 

SDATA 

[deg]-> 

<!  ENTITY 

prime 

SDATA 

[prime]"  > 

<!ENTITY 

Prime 

SDATA 

[Prime]"  > 

<!ENTITY 

female 

SDATA 

[female]' > 

<  lENllTY 

agr 

SDATA 

[agr]"> 

<!  ENTITY 

bgr 

SDATA 

[bgrl"> 

<  lENTITY 

dgr 

SDATA 

[dgr]"> 

<!  ENTITY 

Dgr 

SDATA 

[Dgr]"> 

<  lENTITY 

mgr 

SDATA 

[mgr]"> 

<  lENTITY 

sgr 

SDATA 

[sgr]"> 

<  lENTITY 

Sgr 

SDATA 

|SgrJ'> 

<  lENTITY 

male 

SDATA 

(maler> 

<  lENTITY 

minus 

SDATA 

[minus]  "> 

<  lENTITY 

times 

SDATA 

[times]*  > 

<!ENJilY 

divide 

SDATA 

[divide]" > 

<!ENiTrY 

plusmn 

SDATA 

[plusmn]"  > 

<  '.ENTITY 

le 

SDATA 

[le)"> 

<  IENTlTY 

ge 

SDATA 

[gel"> 

<  lENTITY 

ne 

SDATA 

[ne]"> 

<  lENTITY 

fracl2 

SDATA 

[fracl2]"> 

<  lENTITY 

fracl3 

SDATA 

[fracl3J"> 

<  lENTITY 

fracU 

SDATA 

[fracl4J"> 

<  I  ENTITY 

frac23 

SDATA 

[frac23)"> 

<  lENTITY 

frac34 

SDATA 

[frac34]"> 

<  lENTITY 

Ntilde 

SDATA 

[Ntilde]  ■> 

<!ENIilY 

ntilde 

SDATA 

|nlilde|"> 

<!ENrilY 

Idquo 

SDATA 

(Idquo)"  > 

<  lENTITY 

rdquo 

SDATA 

[rdquo]"  > 

<!ENT11Y 

reg 

SDATA 

[reg]-> 

<  lENTITY 

trade 

SDATA 

■[trade]"  > 

<  'ENTITY 

iquest 

SDATA 

•[iquest]"  > 

<  lENTITY 

iexcl 

SDATA 

•[iexcl]"  > 

<  lENTITY 

laquo 

SDATA 

■[laquo]"  > 

<  lENTITY 

raquo 

SDATA 

'[raquo]"> 

]> 

Figure  4-1  —  continued. 
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The  AAP  Article  model  provides  opportunities  to  represent 
multiple  elements  in  the  FCES  publication  model.  One  such 
opportunity  is  the  provision  in  the  AAP  Article  model  for  up 
to  five  different  types  of  lists.  The  FCES  model  requires 
three  different  lists.  The  FCES  model  consists  of  bulleted 
(11) ,  unmarked  (12)  and  enumerated (13)  lists.  The  AAP  model 
also  provides  for  three  types  of  emphasis  (i.e.,  el,  e2  and 
e3)  other  than  elements  such  as  bold  (b)  and  italics  (it). 
Currently  the  FCES  model  includes  el  (not  currently  used) ,  e2 
(scientific  name)  and  e3  (reference  title)  as  emphasis  types. 
The  superscript  and  subscript  elements  were  not  initially 
found  in  the  main  part  of  the  AAP  Article  model,  but  are 
required  in  the  FCES  model.  The  two  elements  are  in  the  AAP 
math  model,  which  is  a  part  of  the  AAP  Article  model.  A 
geometric  formula  (f)  consists  of  superscript  and  subscript 
elements  in  the  AAP  Math  model.  Adding  the  geometric  formula 
element  to  the  FCES  model  in  the  structural  locations 
specified  for  the  AAP  Math  model  allowed  the  addition  of 
superscript  and  subscript.  Table  4-1  provides  a  line-by-line 
explanation  of  the  FCES  model,  including  the  new  additions. 


c 
o 

-H 

+J 
n3 

c 

f-i 
X 


0) 
O 

s 

w 
u 


0) 

XI 
(0 


0) 
T3 
O 

-p 

c 

0) 

c 
o 
u 


c 
o 

•H 

+J 

(0 

c 

(0 

rH 

X 


0) 
T3 

o 


0) 

c 

-H 


i 


^  1 


.  5- 


^  s 


.2   < 

■3  R 


H   3  5  £ 


.1 

.a -3 


ji  .a 

■■I 

1! 

I 


5  5  s  P 


■5 


53 


8 
1    -i 


1 


P  3  S  £  S  E— 


1 


54 


0) 
73 

O 

S 

-P 

c 

0) 

■p 

c 
o 
u 

<u 
o 

c 
o 

-H 
4J 
(0 

c 

(0 

rH 

a 

X 

M 


S  S  s  P 


B. 
E 


B. 
•g 
I 


f?  i 


-;?  .5    C   -c^ 


~    0  S   S   H 


f—  o  S  £  ■ 


T3 
0) 

3 
C 

•H 
■P 

c 
o 
o 


I 

0) 
f-{ 
XI 
(0 


0) 
Ti 
O 


0) 

c 

•H 


1 


A 


u 


1 

I 

f 
I 

i 
t 


0) 

3 

c 

-H 

c 
o 
o 


(1) 

rH 

(0 


0) 
T3 

o 

s 

■p 

c 

0) 

■p 

c 
o 
u 

o 

c 
o 
-l-l 
■p 

fO 

c 

(0 
X 


Q) 

o 


0) 


.i     3 


I  I 


H    C  5  S  S  H 


i 

"3 


■< 

i 


II 


55 


®  S  S  S  S  P 


u 

V 


T3 
0) 

c 

-H 
-P 

c 
o 
o 

I 
I 


I 

0) 
rH 
Xi 

(0 
Eh 


0) 
TJ 
O 

s 

-p 

c 

0) 
4J 

C 

o 
u 

M-l 

o 

c 
o 

-H 

-p 

(0 

c 

(0 

rH 

a 

X 

M 


0) 
O 


0) 


I  i 


(S  £es£(? 


< 

o 


I  8--i 


.^^ 


A 

r 


56 


P  s£SSsSf< 


s 
I 
3 


0) 

3 
C 

•H 
■P 

c 
o 
o 


0) 

rH 

to 

Eh 


0) 
T3 
O 

s 

■p 

c 

0) 

-p 

c 
o 
u 

o 

c 
o 

•H 

+J 

(0 

c 

(0 
X 


0) 
T3 
O 


0) 

c 


§ 

I 


=3    -a    ^ 


IH^  It 

^  i."  rs  I- 


J    2 : 

H    2   ^  Ci  Ci  S  "G 


a 


£  =■ 


57 


■|i1 


(?  =  65s5S(? 


A 

s 


•§  I  S  : 


sa 


1 


-3 
i 

I 

§ 
I 


^  1 


0) 

C 

•H 
+J 

c 
o 
o 


I 

0) 

(0 
E-t 


a) 

o 

s 

-p 

c 

0) 

-p 

c 
o 
u 

(4-1 

o 

c 
o 

•H 

■p 

(0 

c 

(0 

rH 

a 

X 


0) 
T3 
O 

s 


(U 

c 


D     „    -c   ^    — 


p  3  S  S  s  {— 


3 


I?  i  £5  S  Sf? 


58 


I 


1 


I 


I 
e 
I 

! 

I 

.a 

! 


^s 


0) 

3 

c 

•H 

c 
o 
o 


I 


Si 
(0 
Eh 


0) 
Tl 
O 

c 

0) 
■P 

c 
o 
u 

M-l 

o 

c 
o 

•H 
4J 
(0 

c 

(0 

rH 

a 

X 


0) 

o 


0) 

c 


3         S 


^  I 


P  £  S  S  s  P 


1 
e 


P  =  5  5  S  p 


59 


■o  =  a  5  a  P 


s 


P  I  S  S  S  p 


(?  Isssi? 


0) 

c 

-p 
c 
o 
u 


(0 

EH 


0) 
-O 
O 

s 

-p 

c 

0) 

+j 

c 
o 
u 

o 

c 
o 

■H 

-p 

(0 

c 

(0 
1-4 

o. 

X 
M 


0) 

o 


0) 

c 


Ji  T 


5 

i 


■2  Id      3 

^|l      .1 

4 1 1  ?  '& 


II 


!?  £  S  S|S 


I 


6  .5 


60 


^    s 


(?  I  =S5|! 


a 


0) 

3 
C 

•H 
4J 
C 
O 
O 


0) 

iH 

(0 
Eh 


(?lses^ 


P  1  S  S  6  p 


a 
a 

o 


SI 

•g.'o 

^1 


(?  I  S  5  5  i! 


61 


T3 
0) 

C 

•H 
+J 

c 
o 
o 


I 

XI 
(0 
Eh 


11 

n 


I 


11 

^1 


r 
5 


62 


63 
With  the  exception  of  hyperlink  elements  and  column  and 
row  elements  in  simple  and  complex  tables,  the  model  is  a 
subset  of  the  AAP  Article  model.  Having  the  FCES  DTD  as  a 
subset  of  the  AAP  Article  model  provides  three  benefits  for 
FCES  publications.  First,  little  customization  is  needed  to 
convert  the  information  in  FCES  publications  to  ANSI  standard 
format.  Second,  the  information  in  FCES  publications  has  the 
benefit  of  terminology  and  names  that  are  widely  used  in  the 
United  States.  Third,  the  FCES  model  provides  potential 
portability  of  FCES  publications  to  systems  that  read 
information  in  AAP  Article  model  format. 

The  basis  for  the  design  and  development  of  the  FCES 
model  was  to  allow  information  in  the  FCES  publications  to  be 
described  in  a  form  independent  of  any  computer  hardware  and 
software.  A  process  must  be  established  that  verifies  the 
model  is  an  application-independent  storage  of  FCES 
publications. 


CHAPTER  V 
MODEL  VERIFICATION 


FCES  Publication  Preparation  and  Conversion  to  SGML  Format 

Identification  of  Model  Elements  in  FCES  Publications 

An  electronic  toolkit,  known  as  FAST-WP,  was  developed  at 
the  University  of  Florida  to  make  it  easy  for  authors  and  word 
processors  to  add  special  codes  to  WordPerfect  5.1 
(WordPerfect  Corporation,  1993a)  files  running  under  DOS 
(Cilley  and  Watson,  1992a  and  1992b) .  The  special  codes, 
WordPerfect  styles  with  generic  stylenames,  were  used  to 
define  structural  areas  within  FCES  publications.  The  styles 
were  placed  in  FCES  publications  via  a  pop-up  menu  (Appendix 
F) •  FAST-WP  was  modified  after  the  FCES  DTD  was  developed  to 
include  generic  styles  that  reflect  the  structural  elements  in 
the  model.  The  author  then  used  FAST-WP  to  apply  generic 
styles  to  the  subset  of  FCES  publications.  During  this 
tagging  process  FAST-WP  was  tested,  edited  and  updated  as 
problems  were  encountered.  Appendix  G  provides  a  table 
describing  the  relationships  between  the  model  elements  and 
their  generic  styles  that  were  placed  in  FCES  publications. 
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Conversion  of  FCES  Publications  Into  SGML  Format 

A  computer  program,  WP2SGML  (WordPerfect-to-SGML) ,  was 
written  to  generate  SGML  instances  from  the  FCES  publications 
(Harrison  et  al.,  1992).  Currently,  WP2SGML  converts  an  FCES 
publication  into  an  ASCII  file  with  start  and  end  tags  that 
describe  its  structure.  WP2SGML  was  written  in  C++,  using 
object  oriented  features  of  the  language.  Each  style  has  its 
own  object  for  conversion  from  WordPerfect  to  SGML.  The 
objects  have  inheritance  so  objects  can  be  processed  at  a 
general  or  specific  level.  Instances  (tagged  FCES  documents) 
generated  by  WP2SGML  were  validated  for  conformance  to  the 
FCES  document  model  (DTD)  using  Exoterica's  XGML  Validator 
software  (Version  1.0,  1991).  Initially,  some  documents  did 
not  conform  to  the  FCES  model  due  to  errors  in  the  logic  of 
WP2SGML,  structural  problems  in  the  FCES  model,  or  tagging 
errors  by  the  author.  Editing,  testing,  evaluating,  and 
updating  of  the  FCES  DTD  was  done  when  any  structural 
differences  were  found  in  the  tagged  FCES  documents. 

Many  tags  are  placed  with  WordPerfect  styles.  Certain 
styles  in  WordPerfect,  such  as  footnotes,  figures,  and  tables 
are  readily  apparent  in  WordPerfect  by  the  conversion  program. 
The  conversion  program  uses  the  codes  embedded  in  the 
WordPerfect  file  to  detect  where,  for  example,  a  footnote 
begins  and  ends,  and  where  it  should  be  placed.  It  then 
places  the  footnote,  along  with  start  and  end  tags,  into  the 
FCES  instance  (ASCII  tagged  text  file) .   An  example  of  a  more 
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difficult  conversion  is  the  paragraph,  where  the  program 
requires  the  recognition  of  hard  returns  to  correctly  detect 
the  beginning  and  ending  of  a  paragraph.  Once  the  beginning 
and  end  of  a  paragraph  are  detected,  the  software  places  start 
and  end  tags  into  the  ASCII  file.  There  are  several 
structural  elements  that  are  difficult  to  detect  in  a  standard 
WordPerfect  document.  These  elements  must  be  tagged  with 
styles  for  identification  so  the  conversion  program  can  detect 
them.  The  pop-up  menu  is  used  to  tag  those  structural 
elements,  such  as  heading  levels  and  hyperlinks. 

Conversion  of  FCES  Instances  Into  Retrieval  Format 

Each  element  was  ranked  for  importance  in  on-screen 
display  before  the  conversion  of  FCES  instances  into  retrieval 
system  format.  The  author  used  the  rankings  as  a  starting 
point  for  evaluating  possible  limitations  a  retrieval  system 
might  have  delivering  specific  information  in  FCES 
publications.  A  retrieval  software  could  be  deemed  inadequate 
when  extremely  important  elements  cannot  be  displayed  on- 
screen. The  initial  ranking  also  provided  the  author  a  way  to 
justify  whether  retrieval  software  can  adequately  display  FCES 
information  on-screen  via  different  methods  (e.g.,  pop-up  box 
or  full  screen  display  of  hyperlink  material) .  The  ranking 
levels  for  the  elements  were  extremely  important  (A)  ,  somewhat 
important  (B)  or  not  needed  for  display  (N)  .  There  were 
twenty  two  elements  that  were  extremely  important,  thirty  two 
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somewhat  important  and  fourteen  that  were  not  needed  for 
display. 

Each  retrieval  system  requires  different  processes  to 
successfully  translate  FCES  instances  into  a  format  it 
understands.  A  brief  description  of  the  processes  for  each 
retrieval  system  is  as  follows. 

Candide;  The  Semantic  Data  Modelling  Language  For  FAIRS  DISC8 
And  DISC9 

Query  searches  on  tagged  documents  that  are  stored  by 
themselves  in  a  database  are  cumbersome  and  inefficient  (Beck 
and  Watson,  1992).  Candide  was  designed  specifically  for 
storing  both  the  structure  and  content  (semantic)  of  FCES 
publications.  Candide  is  a  database  management  system  that 
has  storage,  retrieval  and  query  facilities.  The  semantic 
data  model  (Candide)  is  used  to  decompose  each  publication 
into  objects  (Beck  et  al.,  1989a). 

The  objects  represent  the  meaning  of  words  and  can 
interact  with  each  other  to  represent  complex  data  (Beck  et 
al.,  1989b).  A  document  can  be  represented  using  these 
objects.  Candide  differs  from  other  semantic  data  models  by 
its  uniform  treatment  of  data  objects,  query  objects  and  view 
objects.  Candide  also  provides  for  query  searches  about  the 
structural  relationships  between  objects. 
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FAIRS  CD-ROM  DISCS 

FCES  instances  are  grouped  together  in  handbooks  or 
topics.  A  program  called  XTRAN  (Exoterica,  1990)  uses  a  set 
of  rules  to  convert  the  SGML  instances  into  an  object-oriented 
format  similar  to  PROLOG.  The  structure  of  each  file 
(f name. out)  determines  their  translation  (CDM)  into  the 
Candide  database  (Beck  et  al.,  1989a). 

These  Candide  files  are  converted  to  ASCII  files 
(fname.asc)  using  DB2ASC,  where  each  individual  object  or 
chunk  within  a  file  was  treated  as  a  single  ASCII  file.  This 
provides  a  set  of  ASCII  chunks  that  link  together  through 
hyperlinks  within  the  SGML  instance.  Editors  then  review  the 
chunks  for  formatting  errors  and  prepare  the  ASCII  files 
(chunks)  into  a  format  for  on-screen  display.  Tables 
represent  a  major  problem  encountered  by  editors  and 
translation.  The  editors  had  to  chunk  all  tables  by  hand 
because  XTRAN  did  not  have  sufficient  rules  to  support  tables. 
Chunking  involves  breaking  down  a  document  into  smaller 
pieces.  Tables  were  extracted  from  the  WordPerfect  file  and 
saved  as  an  ASCII  file.  Table  extraction  allows  simultaneous 
editing  of  tables  and  other  chunks  from  an  SGML  instance.  The 
final  procedure  is  a  two-step  process  (TXT20BJ) .  The  first 
step  is  translation  of  ASCII  files  (fname.asc)  into  text 
files.  It  removes  all  the  hard  returns  and  places  special 
codes  at  the  end  of  paragraphs.  The  conversion  of  the  text 
files  into  objects,  by  retagging  or  conversion,  is  the  last 
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step.  After  this  conversion,  the  objects  are  placed  into  a 
retrievable  database  such  as  DISCS.  The  database  includes 
both  data  files  and  an  index.  The  index  file  is  a  table 
reference  for  each  chunk  of  information  on  DISCS.  DISCS 
production  relied  on  editors  and  chunking  procedures  for  on- 
screen display  of  documents.  The  SGML  elements  in  the  FCES 
model  were  not  used  in  the  process. 

A  shell  program  (WP2DB)  was  written  to  create,  write 
script,  and  run  batch  files  (three)  for  placing  WordPerfect 
files  into  a  Candide  database.  Each  step  in  the  process  has 
its  own  utility.  A  failure  at  any  step  stops  the  processing 
on  that  document.  The  first  batch  file  reads  WordPerfect 
files,  then  converts  them  into  SGML  format  (fname.sgm)  using 
WP2SGML.  Tables  were  extracted  and  saved  in  ASCII  format 
before  this  process.  The  shell  then  ensures  each  instance  is 
a  valid  SGML  document.  The  second  batch  file  reads  the  SGML 
instances  and  creates  the  command  line  parameters  for  XTRAN. 
It  runs  XTRAN  and  converts  each  instance  into  objects 
(objects. out) .  The  third  batch  file  reads  the  fname.out  file 
structure  and  initializes  the  parameters  in  object  format  for 
CDM.  CDM  translates  the  object  files  into  Candide  database 
format  and  places  them  in  the  database.  Figure  5-1  describes 
the  production  process  for  DISCS. 
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ASCII   Files   Are  Converted    Into  Text    Files. 

Text    Flics  Are  Converted   Into  Objects 

Objects   Are  Translated  And  Placed   in  Retrievable  Database    (TXT20BJ) 


Figure   5-1    —   The   production  process   for   DISCS, 


FAIRS    CD-ROM   DISC9 


The  significant  difference  in  the  development  procedures 
for  DISCS  and  DISC9  was  the  use  of  the  explicit  structure  in 
the  FCES  instances  (SGML  elements) .  Also,  a  parser  was 
developed  to  make  a  one-step  process  out  of  the  XTRAN  and  CDM 
steps.  The  parser  or  translator  is  written  in  C++,  and  uses 
a  chart  to  translate  elements   into  the  Candide  database.      The 
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parser  has  one  grammar  rule  for  each  SGML  element.  The 
grammar  rules  embody  the  way  of  analyzing  the  document  that  is 
being  parsed.  Along  with  the  rule  is  a  template  with  the 
Candide  object  for  the  specific  SGML  element.  Each  template 
is  filled  with  the  actual  data  that  produced  the  object.  The 
parser  uses  the  grammar  rules  to  translate  the  input  string 
and  place  the  data  into  a  template.  The  template  produces  the 
data  object  that  is  put  into  the  Candide  database.  The  rules 
and  templates  in  the  chart  parser  show  the  elements  and  how 
they  will  appear  in  the  Candide  database.  The  grammar  rules 
from  the  chart  parser  are  also  stored  in  the  Candide  database. 
Figure  5-2  describes  the  production  process  for  DISC9. 
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Figure  5-2  —  The  production  process  for  DISC9 
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Multimedia  Viewer 

Microsoft  Multimedia  Viewer  (1994),  is  a  framed-based 
retrieval  software  that  reads  files  in  Rich  Text  Format  (RTF) . 
All  files  created  from  SGML  instances  must  be  saved  in  RTF 
files  (ie.,  topic  files).  RTF  statements,  which  are  specially 
formatted  tags  that  specify  a  particular  type  of  formatting 
information,  are  presented  in  these  topic  files.  Multiple 
topic  files  are  grouped  together  in  the  Viewer  project  file. 
Other  elements  in  a  topic  file  include  destinations,  control 
symbols,  and  groups.  Both  font  and  color  tables  can  be 
developed  to  define  topic  information.  The  key  process  here 
is  that  Multimedia  Viewer  provides  the  formatting,  while  the 
user  provides  the  structure.  A  WordPerfect  macro  (SGML2RTF) 
was  written  to  automate  the  process  from  SGML  instance  to 
Multimedia  format.   The  process  is  as  follows: 

1)  Put  all  instances  and  graphics  to  be  converted  into  a 
directory. 

2)  Select  that  directory  for  conversion. 

3)  Read  "f actshts. tag"  file  for  elements  to  be  considered. 

4)  Set  global  parameters. 

5)  Retrieve  footnote  elements. 

6)  Copy  footnote  elements  to  utility  document. 

7)  Copy  all  sections  <sec>  to  utility  document. 

8)  Create  1  frame  per  document  level  (section  <sec>, 
subsection  level  1  <ssl>,  subsection  level  2  <ss2>,  and 
subsection  level  3  <ss3>) . 
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9)  Format  any  elements  (e.g.,   lists,  topic  paragraphs, 
paragraphs)  in  each  section. 

10)  Write  out  rtf  file. 

11)  Generate  the  table  of  contents  entry  for  the  current 
instance. 

12)  Next  instance  goes  through  the  conversion  process. 

13)  The  macro  ends. 

14)  Use  Multimedia  Viewer  to  create  the  fname.mvb  file  from 
the  project  file,  graphic  files,  and  RTF  files. 

Guide 

Guide  (InfoAccess,  1994)  is  an  electronic  publishing 
system  that  is  designed  around  an  object-oriented  information 
model.  The  model  presents  information  as  a  series  of  linked 
objects  and  manages  relationships  between  them.  All  document 
components,  from  a  single  word  to  a  graphic,  can  be 
represented  as  an  object.  Once  each  object  is  defined  by  a 
command,  it  can  be  linked  to  other  objects.  Guide  provides 
for  live  or  hot  objects  to  be  activated  with  the  mouse  using 
reference,  expansion,  note,  and  command  buttons.  Guide 
provides  a  scripting  language  (LOGiiX)  to  write  definitions 
for  command  buttons.  Guide  differs  from  Multimedia  viewer  in 
that  it  provides  the  structure  for  on-screen  display,  while 
the  user  provides  how  it  will  be  displayed  (formatting) .  A 
program  was  written  in  the  C  programming  language  to  convert 
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the  SGML  instances  into  HML  (hypertext  media  language)  format. 
The  steps  in  the  program  are  as  follows: 

1)  Read  and  parse  an  instance  (fname.sgm). 

2)  The  first  pass  through  each  instance  picks  up  information 
necessary  for  the  conversion,  as  well  as  determining  if 
any  tables  are  present. 

3)  Convert  any  elements  in  instance  to  HML. 

4)  Write  out  the  HML  file. 

AFter  conversion  to  HML,  the  instances  are  then  converted 
into  Guide  (fname.hml  to  fname.gui)  format  as  follows: 

1)  Develop  style  file  for  formatting  purposes. 

2)  Run  Guide  Writer  on  the  instance  converted  to  HML  format 
(fname.hml) . 

3)  Develop  style  file  for  tables.  Files  with  a  TMF 
extension  are  generated  for  each  table. 

4)  Table  Viewer  software  takes  each  fname.tmf  (text)  as 
input  and  displays  the  table  on-screen  (runtime) . 

Results  of  FCES  Instances  Converted  to  Retrieval  Format 

Converting  FCES  Instances  into  FAIRS  Retrieval  Format 

Table  5-1  provides  the  evaluation  of  converting  FCES 
instances  into  FAIRS  retrieval  format  for  DISCS  and  DISC9. 
The  DISCS  process  relied  on  editors  and  chunking  procedures  to 
manually  prepare  the  information  for  on-screen  display.  There 
was  no  automation  in  the  production  process.  The  DISC9 
process  made  use  of  the  explicit  structure  in  the  FCES 


75 
instances.  All  the  elements  were  automatically  translated  to 
the  FAIRS  retrieval  format  because  the  chart  translator 
creates  a  rule  and  template  with  an  object  for  each  element  in 
the  model.  The  objects  were  then  translated  and  placed  into 
the  FAIRS  retrievable  database.  The  translation  of  the 
objects  and  rules  directly  into  the  retrievable  database  is  an 
automatic  process  for  all  elements  in  the  FCES  model.  Table 
5-2  shows  that  the  translation  of  elements  into  DISCS  format 
was  a  manual  process,  while  the  translation  to  DISC9  format 
was  a  totally  automated  process. 

Converting  FCES  Instances  into  Multimedia  Viewer  Format 

Table  5-1  provides  some  translation  comments  on  the 
conversion  of  FCES  instances  to  Multimedia  Viewer  format.  The 
process  from  FCES  instances  to  Multimedia  Viewer  format  was 
primarily  an  automated  process.  Table  5-2  shows  that 
superscript  and  subscript  were  the  only  elements  that 
Multimedia  Viewer  could  not  translate  to  retrieval  system 
format. 

Converting  FCES  Instances  into  Guide  Format 

Table  5-1  provides  some  translation  comments  on  the 
conversion  of  FCES  instances  to  Guide  format,  which  was  an 
automated  process.  Table  5-2  shows  that  Guide  could  translate 
all  elements  in  the  model  to  retrieval  system  format. 
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Interpretation  of  Elements  by  Automated  Retrieval  Systems 

Software  was  written  to  automate  the  translation  of  elements 
into  each  retrieval  system  format  (DISC9,  Multimedia  Viewer  and 
Guide)  for  on-screen  display.  All  of  the  extremely  important 
elements  (A)  were  translated  to  the  three  retrieval  systems.  All 
of  the  somewhat  important  elements  (B)  were  translated  to  DISC9  and 
Guide  retrieval  systems.  The  Mulitmedia  Viewer  retrieval  system 
can  not  display  superscript  or  subscript  characters. 

Possible  Model  Chancres 

Simplification  of  the  model  can  reduce  both  the  knowledge 
authors  need  to  use  FAST-WP  and  redundancy  of  elements  in  FCES 
instances.  There  are  many  elements  that  "ride"  along  in  multiple 
content  models  as  holders  for  other  elements.  Some  of  these 
elements  might  include  front  matter  (fm) ,  title  group  (tig) , 
figure  reference  (fgr) ,  geometric  formula  (f ) ,  author  (au) , 
publishers  front  matter  group  (pubfm) ,  copyright  notice  (crt) , 
copyright  notice  date  (crd) ,  complex  table  head  (cthd) ,  row/column 
dimensions  for  simple  and  complex  tables  (tdim) ,  and  complex  table 
header  (cth) .  The  content  models  for  each  of  these  elements  could 
be  represented  by  themselves  in  the  model. 


CHAPTER  VI 
SUMMARY,  CONCLUSIONS,  AND  RECOMMENDATIONS 


Summary 

The  objective  of  this  research  was  to  model  the  structure 
and  represent  the  information  in  FCES  publications  in  an 
electronic  form  that  is  independent  of  any  specific  computer 
hardware  or  software.  The  initial  research  began  with  a 
graduate  course  in  SGML  principles  and  practices  that  provided 
a  starting  point  for  the  author's  document  analysis  on  FCES 
publications.  Document  analysis  proved  to  be  an  ongoing 
process  throughout  the  research.  The  class  also  illustrated 
the  difficulty  of  attaining  agreement  on  a  model  for  a  set  of 
FCES  publications  from  two  or  more  groups.  To  forestall 
possible  disagreements  on  FCES  document  structure,  the  author 
proposed  to  adopt  an  outside  standard  rather  than  develop  an 
in-house  model.  After  review  of  current  American  publication 
models,  coupled  with  the  author's  document  analysis,  the 
author  selected  the  Association  of  American  Publisher's  (AAP) 
Article  model  as  the  best  representation  of  the  structure  in 
FCES  publications.  The  author  initially  developed  a  subset  of 
the  AAP  Article  model,  and  later  added  structural  elements 
unique  to  FCES  publications  (Appendix  E)  .   Removal  of  the 
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unique  elements  would  produce  a  model  compatible  with  the  AAP 
Article  DTD. 

An  in-house  publishing  tool  (FAST-WP)  was  based,  in  part, 
on  the  FCES  model  developed  by  the  author.  The  generic  styles 
in  FAST-WP  reflect  the  structure  of  the  FCES  model.  The 
author  used  FAST-WP  as  the  publishing  tool  to  tag  the  subset 
of  FCES  publications. 

Once  the  FCES  publications  were  tagged,  an  in-house 
software  (WP2SGML)  was  developed  to  convert  each  publication 
into  SGML  format  (instance).  The  author  helped  develop  the 
structural  conversions  that  WP2SGML  uses  in  the  conversion 
process.  These  structural  conversions  were  based  on  the  FCES 
model.  The  author  then  used  a  parser  to  verify  that  each  FCES 
instance  conformed  to  the  ISO  8879-1986  standards  and  the  FCES 
model.  This  verification  proved  that  the  FCES  model  describes 
the  content  of  FCES  publications. 

After  conversion  of  FCES  tagged  publications  into 
instances,  the  author  ranked  the  priority  level  of  each 
element  in  the  FCES  model  for  on-screen  display.  The  author 
then  developed  how  the  elements  in  the  FCES  model  would  appear 
on-screen.  Software  was  then  written  to  convert  FCES 
instances  into  Multimedia  Viewer  and  Guide  retrieval  system 
format.  All  elements  in  the  FCES  model  were  automatically 
translated  to  FAIRS  DISC9  and  Guide  retrieval  system  format. 
All  elements  in  the  FCES  model,  except  superscript  and 
subscript,  were  automatically  translated  to  Multimedia  Viewer 
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format.  The  automatic  translation  of  FCES  publications  in 
SGML  format  to  FAIRS  DISC9,  Multimedia  Viewer  and  Guide 
retrieval  formats  verify  that  the  FCES  model  is  application- 
independent. 

Conclusions /Findings 

After  converting  a  set  of  FCES  publications  into  several 
retrieval  system  formats  the  following  conclusions  were  drawn: 

1)  SGML  was  a  suitable  method  of  rules  and  syntax  for 
modeling  FCES  publications. 

2)  The  FCES  model  was  validated  based  on  ISO  standard  8879- 
1986. 

3)  The  model  provided  the  structure  for  automatic  conversion 
of  FCES  instances  into  retrieval  system  format. 

4)  The  model  was  verified  as  a  suitable  way  of  describing 
information  in  FCES  publications  by  the  automatic 
conversion  of  FCES  instances  into  retrieval  system 
format. 

Observations 

Other  findings  that  were  not  part  of  the  research 
project,   but   observed   during   the   process   include   the 
following: 
1)    Document  analysis  is  an  ongoing  process  that  affects  the 

entire  conversion  process.    It  determines  document 
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preparation  time  and  the  degree  of  automation  in  the 
conversion  process. 

2)  Prior  to  development  of  the  DTD,  decide  who  has  the 
authority  to  change  the  SGML  model  (DTD) .  Application- 
independent  document  storage  revolves  around  the  DTD. 
The  DTD  is  the  single  most  important  aspect  of  any  SGML 
project.  Previously  converted  publications  must  be 
revalidated  when  there  are  any  changes  in  the  model. 

3)  When  developing  the  FCES  model,  the  more  rigid  the 
structure  the  easier  the  implementation. 

4)  Use  an  existing  model  as  a  starting  point  to  describe  the 
structure  of  publications.  A  subset  of  an  existing  model 
provides  greater  access  to  the  information. 

5)  Eliminate  any  "rider"  elements  that  tag  along  with  other 
elements.  If  possible,  use  only  one  element  to  describe 
the  information. 

6)  Avoid  recursive  content  models  to  simplify  automated 
conversion  from  word  processing  files  to  SGML  format 
(instance) . 

7)  The  specific  retrieval  software (s)  used  is  not  important. 

8)  A  high  priority  when  selecting  a  retrieval  system  should 
be  what  feature (s)  are  required  for  interpreting 
information  in  an  instance. 

9)  Additional  conversion  time  is  required  when  a  retrieval 
system  requires  FCES  instances  in  a  proprietary  SGML 
model  format. 
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Recommendations 

1)  Continue  document  analysis  of  FCES  publications,  refining 
structural  properties  to  decrease  the  author's  knowledge 
requirement  of  the  process. 

2)  Update  model  as  publications  change,  ensuring 
compatibility  between  current  and  older  FCES 
publications. 

3)  Generate  reports  when  there  are  any  changes  or  revisions 
in  the  model . 

4)  Design  new  templates  to  aid  authors  in  the  tagging 
process.  An  example  is  a  document  template  that  has 
blanks  to  fill  in  the  front  matter  of  a  publication. 

5)  Develop  tools  to  automatically  hyperlink  references  in 
FCES  publications  for  tables  and  graphics. 

6)  Continue  reviewing  retrieval  software  as  new  applications 
become  available. 

7)  Link  multiple  software  such  as  authoring,  SGML 
conversion,  error  checking  and  validation,  and  retrieval 
software  into  a  seamless  interface  tool  for  the  author. 

8)  Provide  on-line  documentation  as  a  help  facility  for 
entire  document  conversion  process. 

9)  Support  a  wider  variety  of  word  processors. 

10)  Minimize  or  eliminate  as  much  of  the  manual  tagging 
process  as  possible. 


GLOSSARY 

The  ISO  8879-1986  International  Standard  gives  the  following 

definitions  to  certain  names  or  titles  described  in  this 

research  project: 

abstract  syntax  (of  SGML) :    Rules  that  define  how  markup  is 

added  to  the  data  of  a  document,  without  regard  to  the 

specific  characters  used  to  represent  the  markup. 

application:    Text  processing  application. 

attribute  (of  an  element) :     A  characteristic  quality,  other 

than  type  or  content. 

attribute  definition:     A  member  of  an  attribute  definition 

list;   it  defines  an  attribute  name,   allowed  values,  and 

default  value. 

attribute  definition  list:     A  set  of  one  or  more  attribute 

definitions  defined  by  the  attribute  definition  list  parameter 

of  an  attribute  definition  list  declaration. 

base  document  element:    A  document  element  whose  document 

type  is  the  base  document  type. 

base  document  type:  The  document  type  specified  by  the  first 

document  type  declaration  in  a  prolog. 

CDATA:     Character  data. 

CDATA  entity:   Character  data  entity. 
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character:  An  atom  of  information  with  an  individual 
meaning,  defined  by  a  character  repertoire. 

comment:  A  portion  of  a  markup  declaration  that  contains 
explanations  or  remarks  intended  to  aid  persons  working  with 
the  document . 

concrete  syntax  (of  SGML) :     A  binding  of  the  abstract  syntax 
to   particular   delimiter   characters,   quantities,   markup 
declaration  names,  etc. . 
conforming  SGML  application:   An  SGML 

application  that  requires  documents  to  be  conforming  SGML 
documents,  and  whose  documentation  meets  the  requirements  of 
this  International  Standard. 
conforming  SGML  document:      An  SGML 

document  that  complies  with  all  provisions  of  this 
International  Standard. 

content:  Characters  that  occur  between  the  start-tag  and 
end-tag  of  an  element  in  a  document  instance.  They  can  be 
interpreted  as  data,  proper  subelements,  included  subelements, 
other  markup,  or  a  mixture  of  them. 

NOTE  -  if  an  element  has  an  explicit  content  reference,  or  its 
declared  content  is  "EMPTY",  the  content  is  empty.  In  such 
cases,  the  application  itself  may  generate  data  and  process  it 
as  though  it  were  content  data. 

(content)  model:  Parameter  of  an  element  declaration  that 
specifies  the  model  group  and  exceptions  that  define  the 
allowed  content  of  the  element. 
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core  concrete  syntax:  A  variant,  of  the  reference  concrete 
syntax  that  has  no  short  reference  delimiters. 
data:  The  characters  of  a  document  that  represent  the 
inherent  information  content;  characters  that  are  not 
recognized  as  markup. 

data  content:  The  portion  of  an  element's  content  that  is 
data  rather  than  markup  or  a  subelement. 

data  tag:  A  string  that  conforms  to  the  data  tag  pattern  of  an 
open  element.  It  serves  both  as  the  end-tag  of  the  open 
element  and  as  character  data  in  the  element  that  contains  it. 
declaration:    Markup  declaration. 

declaration  subset:  A   delimited   portion   of   a   markup 
declaration  in  which  other  declarations  can  occur. 
NOTE  -  Declaration  subsets  occur  only  in  document  type,  link 
type,  and  marked  section  declarations. 

delimiter  characters:  Character  class  that  consists  of  each 
SGML  character,  other  than  a  name  character  or  function 
character,  that  occurs  in  a  string  assigned  to  a  delimiter 
role  by  the  concrete  syntax. 

delimiter  set:  A  set  of  assignments  of  delimiter  strings  to 
the  abstract  syntax  delimiter  roles. 

delimiter  (string) :  A  character  string  assigned  to  a  delimiter 
role  by  the  concrete  syntax. 

descriptive  markup:  Markup  that  describes  the  structure  and 
other  attributes  of  a  document  in  a  non-system-specific 
manner,  independently  of  any  processing  that  may  be  performed 
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on  it.  •  In  particular,  it  uses  tags  to  express  the  element 
structure . 

document:  A  collection  of  information  that  is  processed  as  a 
unit.  A  document  is  classified  as  being  of  a  particular 
document  type. 

NOTE  -  in  this  international  Standard,  the  term  almost 
invariably  means  (without  loss  of  accuracy)  an  SGML  document. 
document  character  set:  The  character  set  used  for  all  markup 
in  an  SGML  document,  and  initially  (at  least)  for  data. 
NOTE  -  When  a  document  is  interchanged  between  systems,  its 
character  set  is  translated  to  the  receiving  system  character 
set. 

document  element:  The  element  that  is  the  outermost  element 
of  an  instance  of  a  document  type;  that  is,  the  element  whose 
generic  identifier  is  the  document  type  name. 
docviment  instance:  Instance  of  a  document  type. 
document  type:  A  class  of  documents  having  similar 
characteristics;  for  example,  journal,  article,  technical 
manual,  or  memo. 

(dociunent)  type  declaration:   A  markup 

declaration  that  contains  the  formal  specification  of  a 
document  type  definition. 

docviment  (type)  definition:  Rules,  determined  by  an 
application,  that  apply  SGML  to  the  markup  of  documents  of  a 
particular  type.  A  document  type  definition  includes  a  formal 
specification,  expressed  in  a  document  type  declaration,  of 
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the  element  types,  element  relationships  and  attributes,  and 
references  that  can  be  represented  by  markup.  It  thereby 
defines  the  vocabulary  of  the  markup  for  which  SGML  defines 
the  syntax. 

NOTE  -  A  document  type  definition  can  also  include  comments 
that  describe  the  semantics  of  elements  and  attributes,  and 
any  application  conventions. 
DTD:  Document  type  definition. 

element:  A  component  of  the  hierarchical  structure  defined  by 
a  document  type  definition.  It  is  identified  in  a  document 
instance  by  descriptive  markup,  usually  a  start-tag  and 
end-tag. 

NOTE  -  An  element  is  classified  as  being  of  a  particular 
element  type. 

element  declaration:  A  markup  declaration  that  contains 
the  formal  specification  of  the  part  of  an  element  type 
definition  that  deals  with  the  content  and  markup 
minimization. 

element  structure:  The  organization  of  a  document  into 
hierarchies  of  elements,  with  each  hierarchy  conforming  to  a 
different  document  type  definition. 

element  type:  A  class  of  elements  having  similar 
characteristics;  for  example,  paragraph,  chapter,  abstract, 
footnote,  or  bibliography. 

entity:  A  collection  of  characters  that  can  be  referenced  as 
a  unit. 
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NOTES 

1)  Objects  such  as  book  chapters  written  by  different 
authors,  pi  characters,  or  photographs,  are  often  best  managed 
by  maintaining  them  as  individual  entities. 

2)  The  physical  organization  of  entities  is  system-specific, 
and  could  take  the  form  of  files,  members  of  a  partitioned 
data  set,  components  of  a  data  structure,  or  entries  in  a 
symbol  table. 

entity  declaration:  A  markup  declaration  that  assigns  an  SGML 

name  to  an  entity  so  that  it  can  be  referenced. 

entity  reference:    A  reference  that  is  replaced  by  an  entity. 

NOTE  -  There  are  two  kinds:  named  entity  and  short  reference. 

entity  set:     A  set  of  entity  declarations  that  are  used 

together. 

NOTE  -  An  entity  set  can  be  public  text. 

exclusions:     Elements  that  are  not  allowed  anywhere  in  the 

content  of  an  element  or  its  subelements  even  though  the 

applicable  content  model  or  inclusions  would  permit  them 

optionally. 

general  entity:      An  entity  that  can  be  referenced  from 

within  the  content  of  an  element  or  an  attribute  value 

literal . 

generic  Identifier:  A  name  that  identifies  the  element  type  of 

an  element. 

GI:   Generic  identifier. 

ID:   Unique  identifier. 
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Inclusions:  Elements  that  are  allowed  anywhere  in  the 
content  of  an  element  or  its  subelements  even  though  the 
applicable  model  does  not  permit  them. 

instance  (of  a  document  type) :  The  data  and  markup  for  a 
hierarchy  of  elements  that  conforms  to  a  document  type 
definition. 

mark  up:   To  add  markup  to  a  document. 

markup:  Text  that  is  added  to  the  data  of  a  document  in  order 
to  convey  information  about  it. 

NOTE  -  There  are  four  kinds  of  markup:  descriptive  markup 
(tags) ,  references,  markup  declarations,  and  processing 
instructions . 

(markup)  declaration:  Markup  that  controls  how  other  markup 
of  a  document  is  to  be  interpreted. 

NOTE  -  There  are  13  kinds:  SGML,  entity,  element,  attribute 
definition  list,  notation,  document  type,  link  type,  link  set, 
link  use,  marked  section,  short  reference  mapping,  short 
reference  use,  and  comment. 

(markup)  minimization  feature:  A  feature  of  SGML  that 
allows  markup  to  be  minimized  by  shortening  or  omitting  tags, 
or  shortening  entity  references. 

NOTE  -  Markup  minimization  features  do  not  affect  the  document 
type  definition,  so  a  minimized  document  can  be  sent  to  a 
system  that  does  not  support  these  features  by  first  restoring 
the  omitted  markup.  There  are  five  kinds:  SHORTTAG,  OMITTAG, 
SHORTREF,  DATATAG,  and  RANK. 
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minimization  feature;     Markup  minimization  feature. 
model:     Content  model. 

model  group:  A  component  of  a  content  model  that  specifies 
the  order  of  occurrence  of  elements  and  character  strings  in 
an  element's  content,  as  modified  by  exceptions  specified  in 
the  content  model  of  the  element  and  in  the  content  models  of 
other  open  elements. 

name:  A  name  token  whose  first  character  is  a  name  start 
character. 

name  character:  A  character  that  can  occur  in  a  name:  name 
start  characters,  digits,  and  others  designated  by  the 
concrete  syntax. 

name  group:     A  group  whose  tokens  are  required  to  be  names. 
number:    A  name  token  consisting  solely  of  digits, 
parameter  entity:    An  entity  that  can  be  referenced  from  a 
markup  declaration  parameter. 

parameter  entity  reference:  A  named  entity  reference  to  a 
parameter  entity. 

parsed  character  data:    Zero  or  more  characters  that  occur  in 
a  context  in  which  text  is  parsed  and  markup  is  recognized. 
They  are  classified  as  data  characters  because  they  were  not 
recognized  as  markup  during  parsing. 
PCDATA:    Parsed  character  data. 

reference:  Markup  that  is  replaced  by  other  text,  either 
an  entity  or  a  single  character. 
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reference  concrete  syntax:  A  concrete  syntax,  defined  in 
this  International  Standard,  that  is  used  in  all  SGML 
declarations . 

SGML:  Standard  Generalized  Markup  Language 
SGML  application:  Rules  that  apply  SGML  to  a  text  processing 
application.  An  SGML  application  includes  a  formal 
specification  of  the  markup  constructs  used  in  the 
application,  expressed  in  SGML.  It  can  also  include  a  nonSGML 
definition  of  semantics,  application  conventions,  and/or 
processing. 
NOTES 

1)  The  formal  specification  of  an  SGML  application  normally 
includes  document  type  definitions,  data  content  notations, 
and  entity  sets,  and  possibly  a  concrete  syntax  or  capacity 
set.  If  processing  is  defined  by  the  application,  the  formal 
specification  could  also  include  link  process  definitions. 

2)  The  formal  specification  of  an  SGML  application 
constitutes  the  common  portions  of  the  documents  processed  by 
the  application.  These  common  portions  are  frequently  made 
available  as  public  text. 

3)  The  formal  specification  is  usually  accompanied  by 
comments  and/or  documentation  that  explains  the  semantics, 
application  conventions,  and  processing  specifications  of  the 
application. 

4)  An  SGML  application  exists  independently  of  any 
implementation.  However,  if  processing  is  defined  by  the 
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application,  the  non-SGML  definition  could  include  application 
procedures,  implemented  in  a  programming  or  text  processing 
language. 

SGML  character:  A  character  that  is  permitted  in  an  SGML 
entity. 

SGML  declaration:  A  markup  declaration  that  specifies  the 
character  set,  concrete  syntax,  optional  features,  and 
capacity  requirements  of  a  document's  markup.  It  applies  to 
all  of  the  SGML  entities  of  a  document. 

SGML  document:  A  document  that  is  represented  as  a  sequence  of 
characters,  organized  physically  into  an  entity  structure  and 
logically  into  an  element  structure,  essentially  as  described 
in  this  International  Standard.  An  SGML  document  consists  of 
data  characters,  which  represent  its  information  content,  and 
markup  characters,  which  represent  the  structure  of  the  data 
and  other  information  useful  for  processing  it.  In  particular, 
the  markup  describes  at  least  one  document  type  definition, 
and  an  instance  of  a  structure  conforming  to  the  definition. 
SGML  entity:  An  entity  whose  characters  are  interpreted  as 
markup  or  data  in  accordance  with  this  International  Standard. 
NOTE  -  There  are  three  types  of  SGML  entity:  SGML  document 
entity,  SGML  subdocument  entity,  and  SGML  text  entity. 
SGML  parser:  A  program  (or  portion  of  a  program  or  a 
combination  of  programs)  that  recognizes  markup  in  conforming 
SGML  documents. 
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NOTE  -  if  an  analogy  were  to  be  drawn  to  programming  language 
processors,  an  SGML  parser  would  be  said  to  perform  the 
functions  of  both  a  lexical  analyzer  and  a  parser  with  respect 
to  SGML  documents. 

Standard  Generalized  Markup  Language:    A 

language  for  document  representation  that  formalizes  markup 
and  frees  it  of  system  and  processing  dependencies. 
start-tag:      Descriptive  markup  that  identifies  the  start  of 
an   element   and   specifies   its   generic   identifier   and 
attributes. 

tag:  Descriptive  markup. 

NOTE  -  There  are  two  kinds:   start-tag  and  end-tag. 
text:      Characters. 

NOTE  -  The  characters  could  have  their  normal  character  set 
meaning,  or  they  could  be  interpreted  in  accordance  with  a 
data  content  notation  as  the  representation  of  graphics, 
images,  etc. 

validating  SGML  parser:  A  conforming  SGML  parser  that  can 
find  and  report  a  reportable  markup  error  if  (and  only  if)  one 
exists. 


APPENDIX  A 
DEVELOPMENT  OF  AN  SGML  MODEL 


This  appendix  describes  the  process  for  developing  an 
SGML  application.  Given  a  subset  of  a  class  of  documents  such 
as  fact  sheets  (Figure  A-1) ,  the  documents  are  broken  down 
into  pieces  (Figure  A-2) .  A  tree  structure  (Figure  A-3)  is 
developed  describing  the  structure  of  the  set  of  documents. 
The  model  (DTD)  (Figure  A-4) ,  a  vocabulary  representation  of 
the  document  structure,  is  written  upon  completion  of  the 
document  analysis  and  validated  for  conformance  to  ISO 
8879-1986  standards  and  SGML  syntax.  Tagged  documents  (Figure 
A-5) ,  known  as  instances,  are  validated  to  ensure  conformance 
(i.e.,  no  errors)  with  the  model  (DTD). 
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ANGEUCA 
James  M.  Stephens 

INTRODUCTION 

Angelica  is  a  European  perennial  plant  sometimes  grown  in 
this  country  as  a  culinary  herb.  This  member  of  the  parsley  family, 
related  to  carrots,  grows  in  fields  and  damp  places  from  Labrador 
10  Delaware  and  west  to  Minnesota.  Syria  is  believed  to  be  its 
point  of  origin.  One  species  (A.  sylvestris)  is  called  "Holy  Ghost" 

Use 

The  fresh  stems  and  leafstalks  are  used  as  garnish  and  for 
making  candied  angelica.  The  seeds  and  the  oil  distilled  from  them 
are  used  in  flavoring  foods,  and  the  aromatic  roots  are  used  in 
medicine.  People  in  the  north,  particularly  the  Lapps,  use  it  as  a 
foodstuff,  condiment,  or  medicine,  and  even  chew  it  like  tobacco. 

Description 

The  robust  growing  angelica  plant  is  S-6  feet  tall  and 
resembles  wild  carrot,  although  the  leaves  are  much  broader.  It  has 
large  petioles  and  a  purple-colored  root.  Leaves  are  compound  and 
flowers  are  bome  in  umbels  like  the  carrot  It  is  a  perennial  plant 
that  flowers  every  2  years. 

Culture 

The  plant  thrives  best  in  a  moderately  cool  climate  in  semishade; 
therefore,  it  is  unlikely  to  grow  well  in  Florida.  The  plant  is  most 
readily  propagated  from  division  of  old  roots,  which  can  be  set 
either  in  the  fall  or  spring  about  18  inches  apart  in  3-foot  rows.  If 
seeds  can  be  obtained,  start  seedlings  in  a  seed  bed.  Then 
transplant  to  the  garden. 

Harvesting.  Roots,  stems,  and  seeds  are  harvested  and  used  us 
needed,  with  some  parts  being  ready  3-4  months  after  planting. 
Sometimes  the  roots  of  the  first  year  plants  are  dug,  but  usually  the 
harvest  of  roots  is  defened  until  fall  of  the  second  year. 


Figure  A-1  —  An  FCES  Fact  Sheet. 
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Figure  A-2    —   Document  Analysis   of    FCES   Fact   Sheet, 
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<!DOClYPE  article 

[ 

<!ELEMENT  article  -  - 

(fm,  bdy)> 

<!ELEMF.NT  fm 

(tig.  au*)> 

<!ELEMENT  tig 

(ati)> 

<!ELEMENT  ati 

(#PCDATA)> 

<!ELEMENT  au 

(snm)> 

<!ELEMENT  snm      -  - 

(#PCDATA)> 

<!ELEMHNT  bdy      -  - 

(sec)+> 

<!ELEMKN"I'  sec 

(st,  (ptopl)*,  ssl*)> 

<!ELEMKN'l'  ssl 

(st,  (p  topi)*.  ss2*)> 

<!ELEMF.NT  ss2 

(st,  (p  topi)*,  ss3*)> 

<!ELEMF,NT  ss3 

(st,  (pItopl)*)> 

<!ELEMKNT  topi      -- 

(h?.  p+)> 

<!ELEMF,NT  h 

(#PCDATA)> 

<!ET  .RMENT  st 

(#PCDATA)> 

<!ELEMENT  p 

(#PCDATA)> 

]> 

Figure  A-4  —  DTD  of  FCES  Fact  Sheet, 
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<articlexfm> 

<tigxatl>ANGELICA</atl></tig> 

<auxsnrn>James  M.  Stcphens</snmx/au><;/fm> 

<bdy><secxst>INTRODUCnON</st> 

<p>Angelica  is  a  European  perennial  plant  sometimes  grown  in  this 

country  as  a  culinary  herb.    This  member  of  the  parsley  family, 

related  to  carrots,  grows  in  fields  and  damp  places  from  Labrador 

to  Delaware  and  west  to  Minnesota.    Syria  is  believed  to  be  its 

point  of  origin.  One  species  (A.  sylvestris)  is  called   "Holy 

GhosL''</p> 

<sslxst>Use</st> 

<p>The  fresh  stems  and  leafstalks  are  used  as  garnish  and  for 

making  candied  angelica.  The  seeds  and  the  oil  distilled  from  them 

are  used  in  flavoring  foods,  and  the  aromatic  roots  are  used  in 

medicine.  People  in  the  north,  particularly  the  Lapps,  use  it  as  a 

foodstuff,   condiment,   or   medicine,    and   even    chew    it   like 

tobacco.  </p> 

<ss2><st>Description</st> 

<p>The  robust  growing  angelica  plant  is  5-6  feet  tall  and  resembles 

wild  carrot,  although  the  leaves  are  much  broader.    It  has  large 

petioles  and  a  purple-colored  root.  Leaves  are  compound  and 

flowers  are  home  in  umbels  like  the  carrot.  It  is  a  perennial  plant 

that  flowers  every  2  years.</p> 

<ss3><st>Culture<;/st> 

<p>Thc  plant  thrives   best  in   a  moderately  cool  climate  in 

semishadc;  therefore,  it  is  unlikely  to  grow  well  in  Florida.  The 

plant  is  most  readily  propagated  from  division  of  old  roots,  which 

can  be  set  either  in  the  fall  or  spring  about  18  inches  apart  in 

3-foot  rows.  If  seeds  can  be  obtained,  start  seedlings  in  a  seed  bed. 

Then  transplant  to  the  garden.</p> 

<toplxh>Harvesting.</h>      <p>Roots,   stems,   and   seeds   are 

harvested  and  used  as  needed,  with  some  parts  being  ready  3-4 

months  after  planting.  Sometimes  the  roots  of  the  first  year  plants 

are  dug,  but  usually  the  harvest  of  roots  is  deferred  until  fall  of  the 

second  year.   <ypx/topI> 

</ss3><7ss2></sslx/sec><;/bdy></article> 


Figure  A-5  —  Instance  of  FCES  Fact  Sheet, 


APPENDIX  B 
PROCESS  AND  METHODOLOGY  FOR  DEVELOPING  AN  SGML  APPLICATION 


For  the  purposes  of  presenting  a  general  procedure  to 
develop  an  SGML  application,  the  following  steps  were  drawn 
from  a  basic  SGML  tutorial  attended  by  the  author  (Graphic 
Communications  Association,  1991) .  . 

Recognizing  SGML  Applications 

The  first  step  is  to  recognize  whether  there  is  a  valid 
SGML  application.  Some  projects  that  could  warrant  use  of 
SGML  include  multiple  input  sources,  output  formats  and 
devices,  interchange  requirements  (portability) ,  content 
identification  volume  requirements,  and  multipurpose 
(multiproduct)  databases. 

Establish  goals  for  SGML  implementation 

The  second  step  is  to  set  appropriate  goals  for  the 
application  that  will  drive  an  SGML  analysis.  Goals  are  the 
basis  for  development  of  the  model  (DTD)  and  decide  how  it  is 
written.  Setting  goals  directly  affect  publication  system 
design,  and  might  not  be  attainable  with  the  institutions' 
current  computer  system  design.  A  system  upgrade  may  be 
necessary.   Institutions  commit  to  SGML  applications  usually 
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because  of  the  need  to  either  interchange,  publish,  or  manage 
data.  Generally  SGML  is  a  management  decision,  and  often  not 
the  choice  of  the  editorial  or  publication  staff. 

Production  staffs  are  primarily  concerned  with  getting 
the  material  out  on  time  and  may  not  understand  or  accept  the 
nature  of  the  new  SGML  markup  system.  The  previous  method  may 
have  been  working,  and  production  personnel  do  not  understand 
why  they  have  to  go  to  a  new  system. 

Production  may  also  feel  threatened  by  the  data 
processing  nature  of  SGML.  This  is  because  SGML  differs  from 
other  markup  languages  in  that  it  has  a  DTD,  a  specific  set  of 
rules,  showing  the  structure  of  the  documents.  This  restricts 
production  personnel  to  placing  tags  in  particular  areas  and 
a  hierarchical  order  for  compliance.  Production  may  not 
understand  and  feel  threatened  by  the  way  SGML  turns  documents 
into  a  format  suitable  for  electronic  display. 

SGML  can  also  make  the  work  of  production  personnel  more 
difficult  because  of  the  highly  structured,  seemingly 
inflexible  process  for  creating  and  converting  electronic 
documents.  Thus  it  is  critical  to  see  the  bigger  picture,  of 
identifying  the  data  content,  not  just  getting  the  document 
edited  and  finished  in  the  production  stage. 

One  must  review  management  goals  during  the  initial  phase 
of  developing  an  SGML  application,  then  set  both  long-term  and 
short-term  goals  for  the  SGML  application.  SGML  application 
goals  are  outlined  in  a  requirement's  document  that  requires 
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management  approval.  Upon  approval,  the  requirements' 
document  is  a  useful  referral  and  evaluation  tool  during 
development  of  the  SGML  application.  It  is  important  not  to 
stray  from  the  goals  outlined  in  the  requirements'  document. 
If  goals  do  change,  document  the  change  and  retain  management 
approval  before  continuing  development  of  the  SGML 
application. 

In  summary,  it  is  important  to  stay  focused  on  the  goals 
of  the  SGML  application.  After  management  approval,  refer  to 
the  requirements'  document  frequently  to  ensure  compatibility 
during  development  of  the  SGML  application.  Be  flexible 
during  development  of  the  SGML  application  should  some  goals 
appear  unobtainable.  However,  developers  should  never  change 
the  goals  of  an  SGML  application.  Ensure  the  group  of  people 
reach  a  consensus,  particularly  management,  before  going  on 
with  changes.  During  the  life  of  the  application,  ensure 
there  is  adequate  personnel  familiar  with  SGML  and  its 
terminology.  One  must  continually  update  and  parse  the  DTD 
and  document  instances  during  the  evolution  of  the  project. 
Finally,  evaluate  the  project  by  comparing  the  result  against 
the  requirements'  document. 

Form  a  working  group 

The  third  step  is  to  form  a  working  group  to  develop  the 
SGML  model.  The  SGML  model  (DTD)  can  redefine  user  roles  and 
their  respective  turf  in  relation  to  others.   When  forming  a 
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working  group  one  must  be  prepared  to  deal  with  strong  egos, 
opposing  points  of  view,  and  outright  hostility  because  it 
will  affect  their  jobs.  A  good  model  (DTD)  is  correct  in 
syntax  and  semantics,  meets  the  needs  of  and  is  acceptable  to 
all  user  groups,  and  fosters  a  sense  of  ownership  among  all 
the  groups  that  will  be  affected. 

Document  analysis 

The  document  analysis  phase  (fourth  step)  begins  by 
defining  the  required  specifications  of  the  document.  An 
example  might  be  an  in-house  letter  with  company 
specifications  that  require  a  front  matter,  body,  and  closing 
matter  as  its  content.  Company  standards  could  require  the 
date,  addressee,  and  company  title  as  the  front  matter.  Body 
matter  can  include  the  information  sent  to  the  addressee, 
while  the  closing  provides  the  author  name,  company  title, 
enclosures,  etc. 

After  defining  document  specifications,  the  working  group 
collects  a  representative  sample  of  documents  for  analysis. 
At  the  initial  stages,  the  working  group  should  not  look  at 
the  content  and  structure  of  a  document  and  attempt  to  convert 
to  SGML  syntax.  They  should  develop  a  tree  structure  to  allow 
non-SGML  personnel  greater  participation  and  reduce  their 
initial  confusion  with  unfamiliar  terms.  The  working  group 
develops  the  tree  structure  by  identifying  the  data  elements 
in  the  set  of  documents.   Abbreviated  names  are  assigned  to 
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each  element  (generic  identifiers)  for  eventual  insertion  into 
the  model. 

The  tree  structure  begins  with  the  major  structural 
elements  of  the  document,  increasing  in  complexity  as  the 
document  is  broken  into  more  complex  pieces  under  each  major 
structural  element.  Document  structure  is  continually  broken 
down  until  the  actual  textual  information  starts  entering  the 
tree  structure.  The  working  group  decides  how  far  to  break 
down  the  structure  during  document  analysis.  For  example,  if 
one  wishes  to  search  on  a  database  of  similar  last  and  first 
names,  the  middle  initial  or  name  would  be  vitally  important 
to  find  the  correct  one.  Tree  structure  complexity  depends  on 
the  goals  of  the  SGML  application. 

In  summary,  the  working  group  should: 

1)  Ensure  data  is  tagged  sufficiently  for  efficient 
database  storage  and  retrieval,  and  on-screen 
display. 

2)  Ensure  data  is  tagged  sufficiently  to  support 
printed  or  hardcopy  delivery. 

3)  Ensure  that  the  tagging  is  appropriate  for  either 
unassisted,  assisted,  or  automated  tag  entry. 

Create  the  Model  (DTD) 

The  fifth  step  involves  the  development  of  a  vocabulary 
representation  (DTD)  of  the  tree  structure  developed  in  the 
document  analysis  phase.   The  DTD  (model)  must  conform  to  ISO 
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8879-1986  Standards,  and  begin  with  a  DOCTYPE  declaration, 
which  specifies  the  type  of  document  (instance)  that  follows 
the  DTD  (e.g.,  book,  journal,  and  article).  DTDs  must  be  kept 
as  readable  as  possible  to  enable  review  by  non-SGML 
personnel. 

Determine  the  SGML  Declaration 

The  sixth  step  involves  the  writing  of  the  SGML 
Declaration.  An  SGML  application  uses  rules  defined  in  the 
SGML  Declaration  such  as  the  syntax  rules,  delimiters,  and 
name  length  for  generic  identifiers.  An  SGML  Declaration 
specifies  the  following: 

1)  The  SGML  features  in  a  document. 

2)  The  reference  concrete  syntax  from  the  ISO  8879- 
1986  standard  to  use  when  tagging  documents. 

3)  Any  modifications  other  than  those  in  the  ISO  8879- 
1986  standard. 

4)  The  character  set  that  is  allowable  in  a  valid  SGML 
document . 

The  ISO  8879-1986  SGML  Declaration  is  an  alternative  to  custom 
declarations . 

SGML  Conformance 

The  seventh  step  involves  SGML  Conformance,  which 
includes  areas  such  as  conforming  document,  application,  and 
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system  (i.e.,  from  parser  to  the  final  product).  A  conforming 
SGML  application  (system)  consists  of: 

1)  Conforming  documents  and  relevant  documentation. 

2)  Allowable  application  conventions. 

3)  Conformation  to  the  system  declaration. 

4)  Support  for  reference  concrete  syntax  and  reference 
capacity  set. 

5)  Document  parsing  facilities  for  all  applications. 

6)  No  enforcement  of  application  conventions  as  though 
part  of  the  ISO  standard. 

SGML  Validation 

The  eighth  step  requires  an  SGML  parser,  DTD,  and  SGML 
instance.  An  SGML  parser  is  the  system  validator,  and  an  SGML 
Document  Type  Definition  (DTD)  is  the  model  created  for  the 
system  application.  The  SGML  document  is  the  instance  of  the 
SGML  application  developed. 

A  parser  reads  the  SGML  Declaration,  decides  if  it  is 
syntactically  correct,  and  learns  the  rules  of  the  SGML 
application.  Then  the  parser  reads  in  the  DTD  and  decides  if 
it  is  syntactically  correct  based  on  the  SGML  Declaration.  If 
the  DTD  is  syntactically  correct  (validated) ,  the  parser 
retrieves  the  document  instance  (tagged  text  file) ,  and 
compares  the  syntax  and  delimiters  with  those  in  the  SGML 
Declaration.  In  the  final  step  the  parser  verifies  that  the 
markup  in  the  instance  conforms  to  the  DTD  model.   An  SGML 
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parser  must  know  the  rules  in  ISO  8879-1986,  use  and  support 
reference  or  concrete  syntax,  parse  any  DTD  by  itself,  and 
parse  any  instance  against  a  DTD.  The  parser  must  also  find 
and  report  markup  errors,  and  not  report  nonexisting  errors. 

Document  The  System  That  Was  Created/Train  It/Maintain  It 

The  ninth  step  includes  a  document  requirements  phase, 
which  must  be  included  with  any  validation  process.  These 
documentation  requirements  must  appear  in  front  of  all  printed 
publications,  for  on-screen  computer  display,  and  in  all 
training  and  promotional  literature. 

System  Declaration 

A  system  declaration  describes  the  SGML  system.  Its  form 
is  very  similar  to  the  SGML  Declaration,  which  defines  such 
areas  as  the  document  features,  syntax,  variable  name  length, 
and  maximum  quantities.  The  system  declaration  gives 
information  on  what  the  system  can  support.  It  must  meet  the 
same  syntax  requirements  as  the  SGML  Declaration. 


APPENDIX  C 
SELECTED  PUBLICATIONS 


The  number  of  FCES  publications  chosen  as  a 
representative  sample  set  was  fifty.  The  documents  were 
selected  from  volumes  one  and  two  on  FAIRS  DISCS.  Only 
documents  created  after  June  1,  1993  were  selected  to  ensure 
that  most  publications  were  tagged  with  FAST-WP.  Each 
filename  begins  with  two  letters  that  describe  the  heading 
under  which  the  publication  can  be  found.  The  five  digit 
number  following  the  two  letters  is  the  actual  document 
number.  The  filenames  and  titles  that  were  randomly  selected 
are  below. 


Filename 

Title 

AC01200 

Alternative  Opportunities  for  Small  Farms: 
Christmas  Tree  Production 

AC02300 

Alternative  Opportunities  for  Small  Farms: 
Pumpkin  Production  Review                       1 

AC02900 

Alternative  Opportunities  for  Small  Farms: 
Watermelon  Production  Review 

AE03100 

Microirrigation  in  Florida:  Systems,  Acreage  and 
Costs 

AG00300 

Labelled  Aquatic  Sites  for  Specific  Herbicides 

AG01600 

Biological  Control  with  Insects:   The  Hydrilla 
Stem  Weevil 

AG01800 

Biological  Control  with  Insects:   The 
Waterhyacinth  Moth 

111 
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Filename 

Title 

AS05800 

Limited  Certification  of  Governmental  and        1 
Private  Applicators 

AS07700 

Florida  Pesticide  Law  and  Rules  (Florida 
Statutes  —  Chapter  487) :   Restrictions  on  the 
Use  of  Methyl  Bromide 

AS09900 

Agricultural  Safety  and  Health  Educational 
Materials  Directory 

AS11200 

Farm  Respiratory  Hazards                       | 

DS12100 

Water  Budgets  for  Florida  Dairy  Farms 

HE08300 

Cradle  Crier:  Your  Child's  Development  During 
Month  Nine 

HE14500 

Ayude  a  Un  Nio  Despus  de  Un  Huracn:   Lo  Que  Debe 
Saber 

HE15400 

Making  Financial  Plans  Together                 1 

HE15900 

Banking  Your  Dollars 

HE16400 

Health  Insurance 

HE18400 

Selecting,  Preparing,  and  Canning:   Country 
Western  Ketchup 

HE18600 

Selecting,  Preparing,  and  Canning:  Chile  Salsa 
(Hot  Tomato-Pepper  Sauce) 

HE18700 

Desarrollo  del  Nio:  Primer  Mas 

HE21400 

Home  Canning:   Canning  Fruit-Based  Baby  Foods 

HE22800 

Selecting,  Preparing,  and  Canning:   Figs 

HE23400 
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APPENDIX  D 
TREE  STRUCTURE 


This  appendix  provides  tree  diagrams  describing  the 
structural  properties  of  elements  in  the  sample  set  of  FCES 
publications.  The  tree  diagrams  were  developed  from  document 
analysis,  the  AAP  Article  model  and  some  unique  elements. 
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APPENDIX  E 
MODEL  ELEMENTS,  ATTRIBUTES,  AND  ENTITIES 

This  appendix  provides  a  comprehensive  list  of  the 

elements,  attributes  and  entities  in  the  FCES  model.   Each 

generic  identifier  (name)  is  either  an  AAP  or  University  of 

Florida  element  tag.  Descriptions  are  provided  below  for  each 

element,  attribute  and  entity. 
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APPENDIX  F 
STRUCTURE  OF  THE  POPUP  MENU  USED  FOR  FAST-WP 


An  electronic  toolkit,  known  as  FAST-WP,  was  developed  at 
the  University  of  Florida  to  make  it  easy  for  authors  and  word 
processors  to  add  special  codes  to  WordPerfect  5.1 
(WordPerfect  Corporation,  1993a)  files  running  under  DOS 
(Cilley  and  Watson,  1992a  and  1992b) .  The  special  codes, 
WordPerfect  styles  with  generic  stylenames,  were  used  to 
define  structural  areas  within  FCES  publications.  The  styles 
were  placed  in  FCES  publications  via  a  pop-up  menu.  This 
appendix  provides  a  graphic  of  the  contents  of  the  popup  menu. 
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APPENDIX  G 
RELATIONSHIP  BETWEEN  STYLES  AND  ELEMENTS 


This  appendix  provides  a  table  describing  the 
relationships  between  the  model  elements  and  their  generic 
styles  that  were  placed  in  FCES  publications  using  FAST-WP 
authoring  tools. 
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